---
title: "Data extractor"
canonical_url: "https://docs.getdx.com/data-extractor/"
md_url: "https://docs.getdx.com/data-extractor.md"
last_updated: "2026-06-18"
---

# Data extractor
DX offers a self-hosted Data Extractor for customers who need to keep API credentials within their network or cannot allowlist incoming requests from DX. It connects to on-prem tools like GitLab and Jira, and pushes metadata to your Data Cloud database.

The Extractor is distributed as a Docker image. You’ll run a separate instance (e.g., a K8s pod) for each data source. For example, to connect both GitLab and Jira, you would deploy two Extractor instances, each configured with environment variables for its respective tool.


> If you deploy multiple Extractor instances of the **same type** (e.g., two GitHub extractors for two separate GitHub instances), you must set the `EXTRACTOR_ID` environment variable to a unique value on each instance so they don’t conflict. This is not needed when each instance is a different type.


Docker images are distributed via GitHub Package Registry:

```
https://github.com/orgs/get-dx/packages/container/package/extractor
```

## Requirements

- 2 GiB RAM and 1 vCPU per extractor instance
- Each instance must:
  - Run in the same security context as the data source
  - Have outbound access to `https://yourinstance.getdx.net`
- DX Data Cloud credentials: [API key link](https://app.getdx.com/datacloud/api_keys)
- Tokens and credentials for each data source:
  - [GitHub](https://docs.getdx.com/connectors/github/)
  - [GitLab](https://docs.getdx.com/connectors/gitlab/)
  - [Bitbucket Data Center](https://docs.getdx.com/connectors/bitbucket-data-center/)
  - [Jira Data Center](https://docs.getdx.com/connectors/jira-data-center/)
  - [Azure DevOps Server](https://docs.getdx.com/connectors/azure-devops/)
  - Perforce (Helix Core) service account credentials

## Deployment

**Recommended method: Kubernetes (GKE, EKS, AKS)**

1. Create a new Kubernetes cluster
2. Set up logging for support/debugging
3. Copy and customize the appropriate deployment YAML (see below)
4. Run `kubectl apply` to deploy
5. Use `kubectl logs` to verify startup


> Enable logging from the start to simplify debugging and support.


## Monitoring

DX monitors import success. For additional monitoring:

- Check logs for crashes or failed imports
- Monitor pod `restartCount`
- Alert on log patterns

## YAML Templates

### GitHub

#### Required environment variables

| Name                              | Description                                                                                                                                                                                            |
| :-------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| EXTRACTION_TYPE                   | Must be set to **github**<br><br>Example:<br>`github`                                                                                                                                                  |
| DATACLOUD_URL                     | Your Data Cloud instance URL.<br><br>Example:<br>`https://yourinstance.getdx.net`                                                                                                                      |
| DATACLOUD_KEY                     | Data Cloud API key.<br><br>Example:<br>`mPB5sf6w3JahSLMherWp8B7nTps13FKY`                                                                                                                              |
| GITHUB_URL                        | API base URL of your GitHub instance.<br><br>Example:<br>`https://github.myteam.com/api/v3/`                                                                                                           |
| GITHUB_APP_ID                     | GitHub App ID<br><br>Example:<br>`320840`                                                                                                                                                              |
| GITHUB_PEM_64                     | Base64 encoded content of your PEM file.                                                                                                                                                               |
| EXTRACTOR_PROXY_URL               | Proxy URL \- Optional. Acts as middleware to forward API requests to DataCloud.<br><br>Example:<br>`proxy.getdx.net`                                                                                   |
| EXTRACTOR_PROXY_PORT              | Proxy port<br><br>Example:<br>`80`                                                                                                                                                                     |
| EXTRACTOR_PROXY_USER              | Proxy username<br><br>Example:<br>`dxuser`                                                                                                                                                             |
| EXTRACTOR_PROXY_PASS              | Proxy password                                                                                                                                                                                         |
| GITHUB_EXTRACT_PULL_COMMITS       | Optional. Enhanced GitHub extraction that pulls commits for each pull request.<br><br>Example:<br>`true`                                                                                               |
| GITHUB_EXTRACT_TRUNK_COMMITS      | Optional. Extract commits from the default branch of each repository—useful for teams using trunk-based development. Requires `contents:read` permission on your GitHub App.<br><br>Example:<br>`true` |
| GITHUB_EXTRACT_ONLY_PRIVATE_REPOS | Optional. Only extract private repositories (by default, both public and private repositories are extracted).<br><br>Example:<br>`true`                                                                |
| GITHUB_EXTRACT_FORKED_REPOS       | Optional. Extract forked repositories (by default, forked repositories are skipped).<br><br>Example:<br>`true`                                                                                         |
| GITHUB_EXTRACT_ARCHIVED_REPOS     | Optional. Extract archived repositories (by default, archived repositories are skipped).<br><br>Example:<br>`true`                                                                                     |

#### Kubernetes deployment YAML template (GitHub)

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-github
spec:
  replicas: 1
  selector:
	matchLabels:
  	app: dx-extractor-github
  template:
	metadata:
  	labels:
    	app: dx-extractor-github
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: github-connector-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: github-connector-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "github"
    	- name: GITHUB_PEM_64
      	valueFrom:
        	secretKeyRef:
          	name: github-connector-secrets
          	key: GITHUB_PEM_64
    	- name: GITHUB_URL
      	value: "https://api.github.com"
    	- name: GITHUB_APP_ID
      	valueFrom:
        	secretKeyRef:
          	name: github-connector-secrets
          	key: GITHUB_APP_ID
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always
```

### GitLab

#### Required environment variables

| Name                             | Description                                                                                                |
| :------------------------------- | :--------------------------------------------------------------------------------------------------------- |
| EXTRACTION_TYPE                  | Must be set to **gitlab**<br><br>Example:<br>`gitlab`                                                      |
| DATACLOUD_URL                    | Your Data Cloud instance URL.<br><br>Example:<br>`https://yourinstance.getdx.net`                          |
| DATACLOUD_KEY                    | Data Cloud API key.<br><br>Example:<br>`mPB5sf6w3JahSLMherWp8B7nTps13FKY`                                  |
| GITLAB_URL                       | API base URL of your GitLab instance.<br><br>Example:<br>`https://gitlab.com/`                             |
| GITLAB_API_TOKEN                 | GitLab App ID<br><br>Example:<br>`glpat-31RAZpMWxzX\_m9BBnLyY`                                             |
| EXTRACTOR_PROXY_URL              | Proxy URL for to send api request to datacloud<br><br>Example: proxy.getdx.net                             |
| EXTRACTOR_PROXY_PORT             | Proxy port<br><br>Example:<br>`80`                                                                         |
| EXTRACTOR_PROXY_USER             | Proxy username<br><br>Example:<br>`dxuser`                                                                 |
| EXTRACTOR_PROXY_PASS             | Proxy password                                                                                             |
| GITLAB_EXTRACT_FORKED_PROJECTS   | Optional. Extract forked projects (by default, forked projects are skipped).<br><br>Example:<br>`true`     |
| GITLAB_EXTRACT_ARCHIVED_PROJECTS | Optional. Extract archived projects (by default, archived projects are skipped).<br><br>Example:<br>`true` |

#### Kubernetes deployment YAML template (GitLab)

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-gitlab
spec:
  replicas: 1
  selector:
	matchLabels:
  	app: dx-extractor-gitlab
  template:
	metadata:
  	labels:
    	app: dx-extractor-gitlab
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: gitlab-connector-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: gitlab-connector-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "gitlab"
    	- name: GITLAB_URL
      	value: "https://gitlab.com/"
    	- name: GITLAB_API_TOKEN
      	valueFrom:
        	secretKeyRef:
          	name: github-connector-secrets
          	key: GITLAB_API_TOKEN
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always
```

### Bitbucket Data Center

#### Required environment variables

| Name                             | Description                                                                                                                                                                                                                                                                                         |
| :------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| EXTRACTION_TYPE                  | Must be set to **bitbucket_data_center**<br><br>Example: `bitbucket_data_center`                                                                                                                                                                                                                    |
| DATACLOUD_URL                    | Your Data Cloud instance URL.<br><br>Example:<br>`https://yourinstance.getdx.net`                                                                                                                                                                                                                   |
| DATACLOUD_KEY                    | Data Cloud API key.<br><br>Example:<br>`mPB5sf6w3JahSLMherWp8B7nTps13FKY`                                                                                                                                                                                                                           |
| BITBUCKET_URL                    | API base URL of your Bitbucket Data Center instance.<br><br>Example:<br>`https://bitbucket.somehost.net`                                                                                                                                                                                            |
| BITBUCKET_USERNAME               | Username of your Bitbucket service account (if using basic auth).<br><br>Example:<br>`dxuser`                                                                                                                                                                                                       |
| BITBUCKET_PASSWORD               | Password of your Bitbucket service account (if using basic auth).<br><br>Example:<br>`password`                                                                                                                                                                                                     |
| BITBUCKET_API_KEY                | API key of your Bitbucket service account **if not using Basic Auth**<br><br>Example:<br>`api\_key`                                                                                                                                                                                                 |
| EXTRACTOR_PROXY_URL              | Proxy URL for to send api request to datacloud<br><br>Example:<br>`proxy.getdx.net`                                                                                                                                                                                                                 |
| EXTRACTOR_PROXY_PORT             | Proxy port<br><br>Example:<br>`80`                                                                                                                                                                                                                                                                  |
| EXTRACTOR_PROXY_USER             | Proxy username<br><br>Example:<br>`dxuser`                                                                                                                                                                                                                                                          |
| EXTRACTOR_PROXY_PASS             | Proxy password                                                                                                                                                                                                                                                                                      |
| BITBUCKET_PROJECT_KEYS_ALLOWLIST | (optional) Comma-delimited list of project keys for DX to import<br><br>Example:<br>`PROJ1,PROJ2`                                                                                                                                                                                                   |
| BITBUCKET_IMPORT_COMMITS         | (optional) Set to **`true`** to enable commit data ingestion from Bitbucket Data Center. Your Data Cloud environment must have the Bitbucket Data Center commits schema applied for commits to be stored—see [conditional schemas](https://docs.getdx.com/administration/conditional-schemas/).<br><br>Example:<br>`true` |

#### Kubernetes deployment YAML template (Bitbucket Data Center)

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-bitbucket
spec:
  replicas: 1
  selector:
	matchLabels:
  	app: dx-extractor-bitbucket
  template:
	metadata:
  	labels:
    	app: dx-extractor-bitbucket
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "bitbucket_data_center"
    	- name: BITBUCKET_URL
      	value: "https://bitbucket.somehost.net"
      - name: BITBUCKET_API_KEY
            valueFrom:
            secretKeyRef:
            name: dx-secrets
            key: BITBUCKET_API_KEY
    	- name: BITBUCKET_USERNAME    # Required for basic auth only
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: BITBUCKET_USERNAME
    	- name: BITBUCKET_PASSWORD    # Required for basic auth only
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: BITBUCKET_PASSWORD
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always
```

### Jira Data Center

#### Required environment variables

| Name                 | Description                                                                                                      |
| :------------------- | :--------------------------------------------------------------------------------------------------------------- |
| EXTRACTION_TYPE      | Must be set to **jira_data_center**<br><br>Example:<br>`jira_data_center`                                        |
| DATACLOUD_URL        | Your Data Cloud instance URL.<br><br>Example:<br>`https://yourinstance.getdx.net`                                |
| DATACLOUD_KEY        | Data Cloud API key.<br><br>Example:<br>`mPB5sf6w3JahSLMherWp8B7nTps13FKY`                                        |
| JIRA_URL             | API base URL of your Jira Data Center instance.<br><br>Example:<br>`https://jira.somehost.net/rest/api/2/`       |
| JIRA_API_TOKEN       | Personal Access Token (PAT) for your Jira service account.<br><br>Example:<br>`mPB5sf6w3JahSLMherWp8B7nTps13FKY` |
| JIRA_USERNAME        | Username of your Jira service account (if using basic auth).<br><br>Example:<br>`dxuser`                         |
| JIRA_PASSWORD        | Password of your Jira service account (if using basic auth).<br><br>Example:<br>`password`                       |
| EXTRACTOR_PROXY_URL  | Proxy URL for to send api request to datacloud<br><br>Example:<br>`proxy.getdx.net`                              |
| EXTRACTOR_PROXY_PORT | Proxy port<br><br>Example:<br>`80`                                                                               |
| EXTRACTOR_PROXY_USER | Proxy username<br><br>Example:<br>`dxuser`                                                                       |
| EXTRACTOR_PROXY_PASS | Proxy password                                                                                                   |

**User Linking**
Unlike other Jira integrations, the Jira extractor does NOT extract user data by itself. Instead, as Jira issues come in, DX looks at the creator/assignee and create/updates the Jira user record in the database accordingly. This may cause delays in syncing user data or unlinked Jira usernames.

#### Kubernetes deployment YAML template (Jira Data Center)

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-jira
spec:
  replicas: 1
  selector:
	matchLabels:
  	app: dx-extractor-jira
  template:
	metadata:
  	labels:
    	app: dx-extractor-jira
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "jira_data_center"
    	- name: JIRA_URL
      	value: "https://jira.somehost.net/rest/api/2/"
      - name: JIRA_API_TOKEN
            valueFrom:
            secretKeyRef:
            name: dx-secrets
            key: JIRA_API_TOKEN
    	- name: JIRA_USERNAME    # Required for basic auth only
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: JIRA_USERNAME
    	- name: JIRA_PASSWORD    # Required for basic auth only
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: JIRA_PASSWORD
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always
```

### Azure DevOps (ADO) Server

#### Required environment variables

| Name                             | Description                                                                                                                                                                                                                             |
| :------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| EXTRACTION_TYPE                  | Must be set to **ado_server**<br><br>Example:<br>`ado_server`                                                                                                                                                                           |
| DATACLOUD_URL                    | Your Data Cloud instance URL.<br><br>Example:<br>`https://yourinstance.getdx.net`                                                                                                                                                       |
| DATACLOUD_KEY                    | Data Cloud API key.<br><br>Example:<br>`mPB5sf6w3JahSLMherWp8B7nTps13FKY`                                                                                                                                                               |
| ADO_SERVER_BASE_URL              | Base URL of your Azure DevOps Server instance.<br><br>Example:<br>`https://devops.mycompany.com`                                                                                                                                        |
| ADO_SERVER_ORGANIZATION_NAME     | The organization or collection name in Azure DevOps Server.<br><br>Example:<br>`DefaultCollection`                                                                                                                                      |
| ADO_SERVER_PERSONAL_ACCESS_TOKEN | Personal Access Token for authenticating with Azure DevOps Server.<br><br>Example:<br>`your-ado-personal-access-token`                                                                                                                  |
| ADO_SERVER_CONNECTOR_TYPE        | Type of data to extract. Must be **repos**, **boards**, or **pipelines**.<br>**repos**: Extracts repository data<br>**boards**: Extracts work item/board data<br>**pipelines**: Extracts pipeline/build data<br><br>Example:<br>`repos` |
| EXTRACTOR_PROXY_URL              | Proxy URL for to send api request to datacloud<br><br>Example:<br>`proxy.getdx.net`                                                                                                                                                     |
| EXTRACTOR_PROXY_PORT             | Proxy port<br><br>Example:<br>`80`                                                                                                                                                                                                      |
| EXTRACTOR_PROXY_USER             | Proxy username<br><br>Example:<br>`dxuser`                                                                                                                                                                                              |
| EXTRACTOR_PROXY_PASS             | Proxy password                                                                                                                                                                                                                          |
| EXTRACTOR_ID                     | Unique identifier for the extractor instance. Required when running multiple instances (repos, boards, pipelines). Can be any random unique ID.<br><br>Example:<br>`102`                                                                |


> To extract repos, boards, and pipelines data from Azure DevOps Server, you need to run three separate extractor instances - one with `ADO_SERVER_CONNECTOR_TYPE=repos`, another with `ADO_SERVER_CONNECTOR_TYPE=boards`, and a third with `ADO_SERVER_CONNECTOR_TYPE=pipelines`.


#### Docker Compose template (ADO Server)

```yaml
services:
  extractor:
    image: ghcr.io/get-dx/extractor:perforce
    environment:
      DATACLOUD_URL: "https://yourinstance.getdx.net"
      DATACLOUD_KEY: "mPB5sf6w3JahSLMherWp8B7nTps13FKY"
      EXTRACTION_TYPE: "ado_server"
      ADO_SERVER_BASE_URL: "https://devops.mycompany.com"
      ADO_SERVER_ORGANIZATION_NAME: "DefaultCollection"
      ADO_SERVER_PERSONAL_ACCESS_TOKEN: "your-personal-access-token"
      ADO_SERVER_CONNECTOR_TYPE: "repos" # Use "repos", "boards", or "pipelines"
      EXTRACTOR_ID: "102"
      LOG_LEVEL: "DEBUG"
      LOG_FORMAT: "json"
    restart: always
```

#### Kubernetes deployment YAML template (ADO Server - Repos)

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-ado-repos
spec:
  replicas: 1
  selector:
	matchLabels:
  	app: dx-extractor-ado-repos
  template:
	metadata:
  	labels:
    	app: dx-extractor-ado-repos
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "ado_server"
    	- name: ADO_SERVER_BASE_URL
      	value: "https://devops.mycompany.com"
    	- name: ADO_SERVER_ORGANIZATION_NAME
      	value: "DefaultCollection"
    	- name: ADO_SERVER_PERSONAL_ACCESS_TOKEN
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: ADO_SERVER_PERSONAL_ACCESS_TOKEN
    	- name: ADO_SERVER_CONNECTOR_TYPE
      	value: "repos"
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always
```

#### Kubernetes deployment YAML template (ADO Server - Boards)

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-ado-boards
spec:
  replicas: 1
  selector:
	matchLabels:
  	app: dx-extractor-ado-boards
  template:
	metadata:
  	labels:
    	app: dx-extractor-ado-boards
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "ado_server"
    	- name: ADO_SERVER_BASE_URL
      	value: "https://devops.mycompany.com"
    	- name: ADO_SERVER_ORGANIZATION_NAME
      	value: "DefaultCollection"
    	- name: ADO_SERVER_PERSONAL_ACCESS_TOKEN
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: ADO_SERVER_PERSONAL_ACCESS_TOKEN
    	- name: ADO_SERVER_CONNECTOR_TYPE
      	value: "boards"
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always
```

#### Kubernetes deployment YAML template (ADO Server - Pipelines)

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-ado-pipelines
spec:
  replicas: 1
  selector:
	matchLabels:
  	app: dx-extractor-ado-pipelines
  template:
	metadata:
  	labels:
    	app: dx-extractor-ado-pipelines
	spec:
  	containers:
  	- name: dx-extractor
    	image: ghcr.io/get-dx/extractor:latest
    	env:
    	- name: DATACLOUD_URL
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_URL
    	- name: DATACLOUD_KEY
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: DATACLOUD_KEY
    	- name: EXTRACTION_TYPE
      	value: "ado_server"
    	- name: ADO_SERVER_BASE_URL
      	value: "https://devops.mycompany.com"
    	- name: ADO_SERVER_ORGANIZATION_NAME
      	value: "DefaultCollection"
    	- name: ADO_SERVER_PERSONAL_ACCESS_TOKEN
      	valueFrom:
        	secretKeyRef:
          	name: dx-secrets
          	key: ADO_SERVER_PERSONAL_ACCESS_TOKEN
    	- name: ADO_SERVER_CONNECTOR_TYPE
      	value: "pipelines"
    	- name: LOG_LEVEL
      	value: "DEBUG"
    	- name: LOG_FORMAT
      	value: "json"
  	restartPolicy: Always
```

The REST API version supported by your server depends on which version of ADO Server you have installed:

| ADO Server Version | Maximum API Version |
| ------------------ | ------------------- |
| 2019               | 5.0                 |
| 2020               | 5.1                 |
| 2022+              | 6.0                 |

DX defaults to API version `6.0`.

If your server only supports an older version, set `ADO_SERVER_API_VERSION` to match your server's maximum supported version (e.g. `5.0` for ADO Server 2019).

### Perforce (Helix Core)

#### Setup instructions

1. Create a dedicated Perforce service account for DX extraction.
2. Ensure the account can access the depots and changelists you want DX to import.
3. If you want review metadata from Helix Swarm, create a Swarm account with API access.
4. Add the required environment variables below and deploy one extractor instance with `EXTRACTION_TYPE=perforce_extractor`.
5. Verify startup logs show successful Perforce connection checks before relying on scheduled syncs.


> Swarm settings are optional. If `SWARM_URL` is not set, DX imports Perforce depots, users, groups, and changelists but skips Swarm reviews.



> If you run multiple Perforce extractor instances for different Perforce/Swarm environments, set a unique `EXTRACTOR_ID` per instance.



> DX Perforce review extraction uses Helix Swarm REST API **v11** endpoints. Use Helix Swarm version **2022.1 or newer** to ensure API compatibility.



> Recent Swarm review sync scans review states `approved`, `approved:commit`, `rejected`, and `archived` for reviews created in the last **45 days**. Because Swarm does not provide the updated-time filter DX needs here, late changes on older reviews can be missed once they fall outside that 45-day window.


#### Required and optional environment variables

| Name                 | Description                                                                                                                                  |
| :------------------- | :------------------------------------------------------------------------------------------------------------------------------------------- |
| EXTRACTION_TYPE      | Must be set to **perforce_extractor**.<br><br>Example:<br>`perforce_extractor`                                                               |
| DATACLOUD_URL        | Your Data Cloud instance URL.<br><br>Example:<br>`https://yourinstance.getdx.net`                                                            |
| DATACLOUD_KEY        | Data Cloud API key.<br><br>Example:<br>`mPB5sf6w3JahSLMherWp8B7nTps13FKY`                                                                    |
| PERFORCE_PORT        | Perforce server address and port (supports SSL endpoints).<br><br>Example:<br>`ssl:perforce.company.com:1666`                                |
| PERFORCE_USERNAME    | Perforce username for the service account.<br><br>Example:<br>`dx-extractor`                                                                 |
| PERFORCE_PASSWORD    | Perforce password or ticket secret for the service account.                                                                                  |
| SWARM_URL            | Optional. Base URL of your Helix Swarm instance.<br><br>Example:<br>`https://swarm.company.com`                                              |
| SWARM_USERNAME       | Optional. Swarm username (required when `SWARM_URL` is set).<br><br>Example:<br>`dx-swarm`                                                   |
| SWARM_PASSWORD       | Optional. Swarm password or token (required when `SWARM_URL` is set).                                                                        |
| EXTRACTOR_ID         | Optional. Unique ID for this extractor instance. Recommended when running multiple Perforce extractors.<br><br>Example:<br>`perforce-prod-1` |
| EXTRACTOR_PROXY_URL  | Optional. Proxy URL to forward requests to Data Cloud.<br><br>Example:<br>`proxy.getdx.net`                                                  |
| EXTRACTOR_PROXY_PORT | Optional. Proxy port.<br><br>Example:<br>`80`                                                                                                |
| EXTRACTOR_PROXY_USER | Optional. Proxy username.<br><br>Example:<br>`dxuser`                                                                                        |
| EXTRACTOR_PROXY_PASS | Optional. Proxy password.                                                                                                                    |
| SLEEP_DURATION       | Optional. Polling interval in seconds between extraction cycles.<br><br>Example:<br>`300`                                                    |
| LOG_LEVEL            | Optional. Log verbosity (`debug`, `info`, etc.).<br><br>Example:<br>`debug`                                                                  |
| LOG_FORMAT           | Optional. Log format (`json` or `text`).<br><br>Example:<br>`json`                                                                           |

#### Docker Compose template (Perforce)

```yaml
services:
  extractor:
    image: ghcr.io/get-dx/extractor:latest
    environment:
      DATACLOUD_URL: "https://yourinstance.getdx.net"
      DATACLOUD_KEY: "your-datacloud-api-key"
      EXTRACTION_TYPE: "perforce_extractor"
      PERFORCE_PORT: "ssl:perforce.company.com:1666"
      PERFORCE_USERNAME: "dx-extractor"
      PERFORCE_PASSWORD: "your-perforce-password-or-ticket"
      SWARM_URL: "https://swarm.company.com" # Optional
      SWARM_USERNAME: "dx-swarm" # Required when SWARM_URL is set
      SWARM_PASSWORD: "your-swarm-password-or-token" # Required when SWARM_URL is set
      EXTRACTOR_ID: "perforce-prod-1" # Recommended if multiple Perforce extractors are deployed
      LOG_LEVEL: "debug"
      LOG_FORMAT: "json"
    restart: always
```

Use the Perforce-specific image tags for Perforce extraction:

- `ghcr.io/get-dx/extractor:perforce` (GLIBC/slim)
- `ghcr.io/get-dx/extractor:perforce-alpine` (Alpine)

Versioned tags are also available:

- `ghcr.io/get-dx/extractor:<version>-perforce`
- `ghcr.io/get-dx/extractor:<version>-perforce-alpine`

If you build the extractor image from source instead of pulling these GHCR tags, include the Perforce SDK during build:

```yaml
services:
  extractor:
    build:
      context: .
      args:
        INCLUDE_P4: "true"
```

#### Kubernetes deployment YAML template (Perforce)

```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dx-extractor-perforce
spec:
  replicas: 1
  selector:
    matchLabels:
      app: dx-extractor-perforce
  template:
    metadata:
      labels:
        app: dx-extractor-perforce
    spec:
      containers:
        - name: dx-extractor
          image: ghcr.io/get-dx/extractor:perforce
          env:
            - name: DATACLOUD_URL
              valueFrom:
                secretKeyRef:
                  name: perforce-connector-secrets
                  key: DATACLOUD_URL
            - name: DATACLOUD_KEY
              valueFrom:
                secretKeyRef:
                  name: perforce-connector-secrets
                  key: DATACLOUD_KEY
            - name: EXTRACTION_TYPE
              value: "perforce_extractor"
            - name: PERFORCE_PORT
              value: "ssl:perforce.company.com:1666"
            - name: PERFORCE_USERNAME
              valueFrom:
                secretKeyRef:
                  name: perforce-connector-secrets
                  key: PERFORCE_USERNAME
            - name: PERFORCE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: perforce-connector-secrets
                  key: PERFORCE_PASSWORD
            - name: SWARM_URL
              value: "https://swarm.company.com" # Optional
            - name: SWARM_USERNAME
              valueFrom:
                secretKeyRef:
                  name: perforce-connector-secrets
                  key: SWARM_USERNAME
            - name: SWARM_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: perforce-connector-secrets
                  key: SWARM_PASSWORD
            - name: EXTRACTOR_ID
              value: "perforce-prod-1"
            - name: LOG_LEVEL
              value: "DEBUG"
            - name: LOG_FORMAT
              value: "json"
      restartPolicy: Always
```

#### P4Ruby SDK methods used

| Wrapper method              | p4ruby call                                                     | Purpose                                                                                                          |
| --------------------------- | --------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| `list_depots`               | `p4.run_depots`                                                 | All depots                                                                                                       |
| `list_users`                | `p4.run_users("-a")`                                            | All users (incl. inactive)                                                                                       |
| `list_groups`               | `p4.run_groups`                                                 | All groups                                                                                                       |
| `list_user_groups(user)`    | `p4.run_groups("-u", user)`                                     | Group memberships for a user                                                                                     |
| `list_changes(...)`         | `p4.run_changes("-s", "submitted", "-m", max, "PATH@from,@to")` | Submitted changelists, newest first, by depot path + date range                                                  |
| `describe_change_stats(cl)` | `p4.run_describe("-ds", cl)` (with `p4.tagged = false`)         | Diff summary (added/deleted/edited lines) for one changelist                                                     |
| `server_info`               | `p4.run("info")`                                                | Connection verification                                                                                          |
| connection lifecycle        | `P4.new`, `p4.connect`, `p4.run_login`, `p4.disconnect`         | Per-instruction connect/login/disconnect; ticket file at `/tmp/.p4tickets`, `exception_level = P4::RAISE_ERRORS` |

#### Swarm REST API endpoints used

| Wrapper method           | Endpoint                             | Notes                                                 |
| ------------------------ | ------------------------------------ | ----------------------------------------------------- |
| `list_reviews`           | `GET /api/v11/reviews`               | Params: `max`, `after` (pagination cursor), `state[]` |
| `get_review`             | `GET /api/v11/reviews/{id}`          | Single review                                         |
| `list_review_activities` | `GET /api/v11/reviews/{id}/activity` | Params: `max`, `after`                                |
| `list_review_comments`   | `GET /api/v11/reviews/{id}/comments` | —                                                     |
| `server_info`            | `GET /api/v11/version`               | Connection verification                               |
---

## Sitemap

[Overview of all docs pages](/llms.txt)
