Data extractor

DX provides a self-hosted data extractor for customers that need to keep their API credentials stored within their networks or cannot allowlist API requests from DX. The self-hosted data extractor connects to on premise instances such as Gitlab and Jira, and pushes metadata to DX servers.

DX data extractors are provided as a Docker image (please contact your DX account manager for access). You’ll run separate instances (e.g., K8 pods) of the data extractor for each data source you want to import from. For example, if you would like to connect to both Gitlab and Jira, you would set up two data extractor instances that are each configured with environment variables specific to the data source they are connecting to.

Requirements

The infrastructure requirements are 2 GiB of memory and 1 vCPU, multiplied by the number of data extractor instances you are deploying. Each data extractor instance should be run in the same security context as systems that it will make API requests to (e.g. Gitlab, Jira, etc). Each data extractor instance also needs to be able to make outbound requests to your Data Cloud instance.

If you have expertise with Kubernetes, a managed service such as GKE, EKS, and AKS is the recommended method of deployment by following the steps below:

  1. Create a new Kubernetes cluster.
  2. Set up logging so logs can be easily retrieved for support.
  3. Make a copy of the appropriate Deployment YAML template below and set up secrets.
  4. Run kubectl apply to create the Deployment.
  5. Tail the logs to verify successful execution.

DX monitors data imports to ensure that the data extractor is running successfully. If you would like additional system monitoring of the data extractor itself, we recommend monitoring for log output that indicates process failures or crashes (e.g., pod restartCount).

The remainder of this document outlines environment variables and Kubernetes YAML templates for each supported data source that you may configure a data extractor instance for. Remember that you’ll set up separate instances of the Data Extractor for each data source you want to import from.

GitHub

Required environment variables

Name Description
EXTRACTION_TYPE Must be set to github

Example:
gitlab
DATACLOUD_URL Your Data Cloud instance URL.

Example:
https://yourinstance.getdx.net
DATACLOUD_KEY Data Cloud API key.

Example:
mPB5sf6w3JahSLMherWp8B7nTps13FKY
GITHUB_URL API base URL of your GitHub instance.

Example:
https://github.myteam.com/api/v3/
GITHUB_APP_ID GitHub App ID

Example:
320840
GITHUB_PEM_64 Base64 encoded content of your PEM file.
EXTRACTOR_PROXY_URL Proxy URL - Optional. Acts as middleware to forward API requests to DataCloud.

Example:
proxy.getdx.net
EXTRACTOR_PROXY_PORT Proxy port

Example:
80
EXTRACTOR_PROXY_USER Proxy username

Example:
dxuser
EXTRACTOR_PROXY_PASS Proxy password

Kubernetes deployment YAML template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-github
spec:
  replicas: 1  
  selector:
	matchLabels:
  	app: dx-extractor-github
  template:
	metadata:
  	labels:
    	app: dx-extractor-github
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: github-connector-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: github-connector-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "github"
    	- name: GITHUB_PEM_64
      	valueFrom:
        	secretKeyRef:
          	name: github-connector-secrets
          	key: GITHUB_PEM_64
    	- name: GITHUB_URL
      	value: "https://api.github.com"
    	- name: GITHUB_APP_ID
      	valueFrom:
        	secretKeyRef:
          	name: github-connector-secrets
          	key: GITHUB_APP_ID
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always

GitLab

Required environment variables

Name Description
EXTRACTION_TYPE Must be set to gitlab

Example:
gitlab
DATACLOUD_URL Your Data Cloud instance URL.

Example:
https://yourinstance.getdx.net
DATACLOUD_KEY Data Cloud API key.

Example:
mPB5sf6w3JahSLMherWp8B7nTps13FKY
GITLAB_URL API base URL of your GitHub instance.

Example:
https://gitlab.com/
GITLAB_API_TOKEN GitHub App ID

Example:
glpat-31RAZpMWxzX\_m9BBnLyY
EXTRACTOR_PROXY_URL Proxy URL for to send api request to datacloud

Example: proxy.getdx.net
EXTRACTOR_PROXY_PORT Proxy port

Example:
80
EXTRACTOR_PROXY_USER Proxy username

Example:
dxuser
EXTRACTOR_PROXY_PASS Proxy password

Kubernetes deployment YAML template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-gitlab
spec:
  replicas: 1  
  selector:
	matchLabels:
  	app: dx-extractor-gitlab
  template:
	metadata:
  	labels:
    	app: dx-extractor-gitlab
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: gitlab-connector-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: gitlab-connector-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "gitlab"
    	- name: GITLAB_URL
      	value: "https://gitlab.com/"
    	- name: GITLAB_API_TOKEN
      	valueFrom:
        	secretKeyRef:
          	name: github-connector-secrets
          	key: GITLAB_API_TOKEN
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always

Jira Data Center

Required environment variables

Name Description
EXTRACTION_TYPE Must be set to jira_data_center

Example:
jira_data_center
DATACLOUD_URL Your Data Cloud instance URL.

Example:
https://yourinstance.getdx.net
DATACLOUD_KEY Data Cloud API key.

Example:
mPB5sf6w3JahSLMherWp8B7nTps13FKY
JIRA_URL API base URL of your Jira Data Center instance.

Example:
https://jira.somehost.net/rest/api/2/
JIRA_API_TOKEN Personal Access Token (PAT) for your Jira service account.

Example:
mPB5sf6w3JahSLMherWp8B7nTps13FKY
JIRA_USERNAME Username of your Jira service account (if using basic auth).

Example:
dxuser
JIRA_PASSWORD Password of your Jira service account (if using basic auth).

Example:
password
EXTRACTOR_PROXY_URL Proxy URL for to send api request to datacloud

Example:
proxy.getdx.net
EXTRACTOR_PROXY_PORT Proxy port

Example:
80
EXTRACTOR_PROXY_USER Proxy username

Example:
dxuser
EXTRACTOR_PROXY_PASS Proxy password

Kubernetes deployment YAML template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-jira
spec:
  replicas: 1  
  selector:
	matchLabels:
  	app: dx-extractor-jira
  template:
	metadata:
  	labels:
    	app: dx-extractor-jira
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "jira_data_center"
    	- name: JIRA_URL
      	value: "https://jira.somehost.net/rest/api/2/"
      - name: JIRA_API_TOKEN
            valueFrom:
            secretKeyRef:
            name: dx-secrets
            key: JIRA_API_TOKEN
    	- name: JIRA_USERNAME    # Required for basic auth only
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: JIRA_USERNAME
    	- name: JIRA_PASSWORD    # Required for basic auth only
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: JIRA_PASSWORD
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always       

Bitbucket Data Center

Required environment variables

Name Description
EXTRACTION_TYPE Must be set to bitbucket_data_center

Example: bitbucket_data_center
DATACLOUD_URL Your Data Cloud instance URL.

Example:
https://yourinstance.getdx.net
DATACLOUD_KEY Data Cloud API key.

Example:
mPB5sf6w3JahSLMherWp8B7nTps13FKY
BITBUCKET_URL API base URL of your Bitbucket Data Center instance.

Example:
https://bitbucket.somehost.net
BITBUCKET_USERNAME Username of your Bitbucket service account (if using basic auth).

Example:
dxuser
BITBUCKET_PASSWORD Password of your Bitbucket service account (if using basic auth).

Example:
password
BITBUCKET_API_KEY API key of your Bitbucket service account if not using Basic Auth

Example:
api\_key
EXTRACTOR_PROXY_URL Proxy URL for to send api request to datacloud

Example:
proxy.getdx.net
EXTRACTOR_PROXY_PORT Proxy port

Example:
80
EXTRACTOR_PROXY_USER Proxy username

Example:
dxuser
EXTRACTOR_PROXY_PASS Proxy password
BITBUCKET_PROJECT_KEYS_ALLOWLIST (optional) Comma-delimited list of project keys for DX to import

Example:
PROJ1,PROJ2

Kubernetes deployment YAML template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-bitbucket
spec:
  replicas: 1  
  selector:
	matchLabels:
  	app: dx-extractor-bitbucket
  template:
	metadata:
  	labels:
    	app: dx-extractor-bitbucket
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "bitbucket_data_center"
    	- name: BITBUCKET_URL
      	value: "https://bitbucket.somehost.net"
      - name: BITBUCKET_API_KEY
            valueFrom:
            secretKeyRef:
            name: dx-secrets
            key: BITBUCKET_API_KEY
    	- name: BITBUCKET_USERNAME    # Required for basic auth only
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: BITBUCKET_USERNAME
    	- name: BITBUCKET_PASSWORD    # Required for basic auth only
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: BITBUCKET_PASSWORD
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always