ADE Generation Script
This script extracts information from the names of files in a specified FastQ folder to build an ADE file for creating runs. It also uses your credentials on sg-upload-v2-latest.jar to run the following subcommands:
- userInfo
- patient --create
- patient --list
- pipeline --list
Usage
python3 adegen.py [-h] [-j JAR] [-o OUTPUT] [-r REF] [-p PIPELINE] [-s SAMPLETYPE] [-c] folder
Required Arguments
folder
the target folder containing FastQ files
Optional Arguments
-j JAR, --jar JAR
Full path of sg-upload-v2-latest.jar (looks in current directory if not given)
-o OUTPUT, --output OUTPUT
Output Json file (overwrites without warning). If not given, the ADE is written to stdout
-r REF, --ref REF
A name for the userRef attribute of the ADE file. If not given, one is generated from the folder name and the current date/time.
-p PIPELINE, --pipeline PIPELINE
ID of pipeline. This will retrieve the sequencer ID automatically. If not given, you will be asked to select from a list.
-s SAMPLETYPE, --sampletype SAMPLETYPE
sampleTypeId to apply to all samples - defaults to 108000 (Peripheral Blood)
-c, --confirm
Confirm use of script. If not used, you must confirm manually.
-y, --yaml
An application override file.
-fp, --forceplatform Force platform services
--bdsMappingFile
*Path to the mapping file containing mapping of patient references to Serial Numbers
--bdsNumber
*Mandatory Serial Number for all SOPHiA GENETICS bundle solutions
Prerequisites
- Python 3
- sg-upload-v2-latest.jar
- Ensure there is no identifiable information in the file names
Naming Convention
The script relies on the following naming convention for files:
patient_ref<-X>_Sxxx_Lxxx_Rxxx_xxx.fastq.gz
where X is optional and is one of D (DNA), R (RNA), N (Normal), T (Tumor). For example:
pat-1_S1_L001_R1_001.fastq.gz
or
pat-1-D_S1_L001_R1_001.fastq.gz
The part up to -X or _Sxxx (pat-1 in this example) should be a maximum of 30 characters, and must not contain identifiable information as it will be used as a patient reference.
Analyses
Each sample ID will map to one analysis in the ADE and may contain multiple files.
Topologies
There are two topologies – mys and tumorNormal. These are made by having pairs of D/R or N/T files. For example:
| mys | tumorNormal |
|---|---|
| pat-1-D_S1_L001_R1_001.fastq.gz pat-1-D_S1_L001_R2_001.fastq.gz pat-1-R_S2_L001_R1_001.fastq.gz pat-1-R_S2_L001_R2_001.fastq.gz |
pat-1-N_S1_L001_R1_001.fastq.gz pat-1-N_S1_L001_R2_001.fastq.gz pat-1-T_S2_L001_R1_001.fastq.gz pat-1-T_S2_L001_R2_001.fastq.gz |
Files which have unmatched D/R or N/T labels will still have an analysis entry, but no topology.
Example Usages
Minimum
python3 adegen.py ~/fastq/study001/
All Arguments
python3 adegen.py ~/fastq/study001/ -j ~/sg-upload-latest.jar -o ade.json -r Study001 -p 159 -s 708000 -c
NOTE - The same sampleTypeId and pipeline (and retrieved sequencer) will apply to all entries - Make sure you are already logged in with your credentials using the uploader tool
ADE Format Description
This documentation gives a technical description of the ADE format, as well as an example of how to build this json in a DNA/RNA analysis case. The resulting json can be used as input of the uploader tool.
Schema
{
"protocolName": "ADE",
"protocolVersion": "1",
"client": {
"id": SOPHiA DDM global client ID,
"userId": User technical ID (internal or external) for the request
},
"request": {
"definition": {
"userRef": Name of the request (defined by the client),
"sequencerId": Technical ID of the sequencing tool used for this request,
"requestDate": Date of the request. Formatted as epoch seconds (UNIX timestamp),
"isPairedEnd": Whether the sequencing tool creates pairs of files per sample,
"isPrevent": Optional, whether this request is linked to the SOPHiA PREVENT product. Defaults to false
},
"analyses": [
{
"definition": {
"sampleId": Unique sample ID within the request. #multiplexId can be used,
"multiplexId": Sequencer multiplex ID,
"sgaPipelineId": Technical pipeline ID,
"userRef": Name of the sample (defined by the user). The patient ref can be used,
"sampleTypeId": Sample type definition ID (blood, FFPE...) please see after for ,
"sisNumber": Optional unique SIS purchase order number. Only required for BDS-activated pipelines,
"libraryType": Optional library type (dna / rna). Defaults to dna,
"bdsNumber": Mandatory Serial Number for all SOPHiA GENETICS bundle solutions. Can be found on the sticker on the side of box 1 of the bundle solution kit.
},
"patient": {
"personalInformationId": Personal information ID, from the SGP service,
"medicalInformationId": Medical information ID, from the SGA service
},
"restriction": Optional gene panel, in accordance with the consent of the patient. IMPORTANT: Required for somatic analyses,
"genePanel": { // Optional gene panel tag for the analysis
"id": Technical gene panel ID,
"version": Gene panel version,
"regionsHash": hash of the region names of the gene panel
},
"isControlSample": Optional flag for control samples. Defaults to false,
"files": [ // Files attached to the analysis. Can be empty in the context of remotely referenced samples (see request#topology)
{
"definition": {
"name": Original file name, or absolute path
},
"controlKey": Optional original control key, usually a md5 sum of the compressed file
},
{
"definition": {
"name": Original file name, or absolute path
},
"controlKey": Optional original control key, usually a md5 sum of the compressed file
}
]
}
],
"topology": [
{
"definition": {
"type": Type of the source. e.g.: replicate, tumorNormal, mys
},
"references": [
{
"analysisReference": {
"sampleId": Unique sample ID for a given request
},
"requestReference": { // Optional reference to a request, allowing to refer to analyses from existing requests,
"owningClientId": Optional client ID. To be used if the owner of the referred request is different than the current request owner,
"requestId": Referred request ID
},
"role": "dna",
"metadata": {}
},
{
"analysisReference": {
"sampleId": Unique sample ID for a given request
},
"requestReference": null, // Optional reference to a request, allowing to refer to analyses from existing requests,
"role": Role of the analysis in the context of this reference,
"metadata": {} // Context-specific metadata attached to this reference
}
]
}
],
"files": [] // Additional files attached to the request
}
}
Building ADE JSON
Pre-requisites
- Download the uploader tool from the provided documentation
- Login using the uploader tool with your account
- Go to https://www.unixtimestamp.com/ to retrieve the "Current Unix timestamp (in seconds)"; this will be used to fill the requestDate field.
- Get your clientId and userId using the
userInfocommand:
userId and clientId
$ python3 sg-upload-v2-wrapper.py userInfo
{
"userId": 405,
"loginUsername": "dnoble",
"clientId": 12
}
- Get the pipeline_id and sequencer_id you want to apply to the samples (please see the provided documentation).
pipeline_id and sequencer_id
$ python3 sg-upload-v2-wrapper.py pipeline --list
[
{
"pipeline_id": 123,
"pipeline_name": "Pipeline 123",
"analysis_type": "BRCA",
"kit": "Multiplicom_MASTR_assay",
"sequencer_id": 123456
"sequencer": "ILLUMINA_MiSeq",
"experiment_type": "germline",
"paired": true
},
{
"pipeline_id": 456,
"pipeline_name": "Pipeline 456",
"analysis_type": "HCS_v1_1",
"kit": "IDT",
"sequencer_id": 123456
"sequencer": "ILLUMINA_MiSeq",
"experiment_type": "germline",
"paired": true
}
]
On the left are files from a DNA/RNA case (note -D and -R), and on the right is a tumorNormal case (note -N and -T). The part up to -D/-R/-N/-T is the patient ref and should be a maximum of 30 characters.
| pat1-D_S1_L001_R1_001.fastq.gz pat1-D_S1_L001_R2_001.fastq.gz pat1-R_S1_L001_R1_001.fastq.gz pat1-R_S1_L001_R2_001.fastq.gz |
pat1-N_S1_L001_R1_001.fastq.gz pat1-N_S1_L001_R2_001.fastq.gz pat1-T_S1_L001_R1_001.fastq.gz pat1-T_S1_L001_R2_001.fastq.gz |
Create this patient using the command line (please have a look to the provided documentation):
$ python3 sg-upload-v2-wrapper.py userInfo patient --create --patient-ref pat1
Remaining patient(s) to create: pat1
Have saved 1 patient(s)
Then retrieve personalInformationId and medicalInformationId.
medicalInformationId and personalInformationId
$ python3 sg-upload-v2-wrapper.py userInfo patient --list --patient-ref pat1
Should display:
[
{
"medicalInformationId": 111111111,
"personalInformationId": 222222222,
"userRef": "pat1"
}
]
For each sample, build the json chunk of the DNA analysis, using the information gathered above (note that pipeline_id = sgaPipelineId). The libraryType field also depends on the sample type.
Example for DNA analysis
{
"definition": {
"sampleId": "S1",
"multiplexId": "S1",
"sgaPipelineId": 123,
"userRef": "pat1-D", // patient ref followed by -D/-R/-N/-T
"sampleTypeId": 308000,
"libraryType": "dna". // -D = dna, -R = rna, -N = dna, -T = dna
},
"patient": {
"personalInformationId": 222222222,
"medicalInformationId": 111111111
},
"isControlSample": false,
"files": [
{
"definition": {
"name": "/path/to/input/files/pat1-D_S1_L001_R1_001.fastq.gz"
}
},
{
"definition": {
"name": "/path/to/input/files/pat1-D_S1_L001_R2_001.fastq.gz"
}
}
]
}
sampleTypeId should be one of:
- OTHER = 8000
- PERIPHERAL_BLOOD = 108000
- FRESH_TUMOR = 208000
- FFPE = 308000
- BIOPSY = 408000
- CELL_LINE = 508000
- CTDNA = 608000
- BUCCAL_SWAB = 708000
- NASOPHARYNGEAL_SWAB = 808000
- SPUTUM = 908000
- BRONCHOALVEOLAR_LAVAGE = 1008000
- SALIVA = 1108000
- BONE_MARROW = 1208000
To link DNA with RNA or Normal with Tumor analyses in our system, it is necessary to generate a "topology" JSON chunk like below (comments have to be removed from the built JSON):
DNA/RNA Topology
{
"definition": {
"type": "mys"
},
"references": [
{
"analysisReference": {
"sampleId": "S1"
},
"role": "dna",
"metadata": {}
},
{
"analysisReference": {
"sampleId": "S2"
},
"role": "rna",
"metadata": {}
}
]
}
tumorNormal Topology
{
"definition": {
"type": "tumorNormal"
},
"references": [
{
"analysisReference": {
"sampleId": "S1"
},
"role": "tumor",
"metadata": {}
},
{
"analysisReference": {
"sampleId": "S2"
},
"role": "normal",
"metadata": {}
}
]
}
Full DNA/RNA Example
{
"protocolName": "ADE",
"protocolVersion": "1",
"client": {
"id": 3,
"userId": 700
},
"request": {
"definition": {
"userRef": "Baseline Super Run™",
"sequencerId": 123456,
"requestDate": 1568975438,
"isPairedEnd": true,
"isPrevent": false
},
"state": null,
"analyses": [
{
"definition": {
"sampleId": "S1",
"multiplexId": "S1",
"sgaPipelineId": 123,
"userRef": "pat1-D",
"sampleTypeId": 308000,
"libraryType": "dna"
},
"patient": {
"personalInformationId": 222222222,
"medicalInformationId": 111111111
},
"isControlSample": false,
"files": [
{
"definition": {
"name": "/path/to/input/files/pat1-D_S1_L001_R1_001.fastq.gz"
}
},
{
"definition": {
"name": "/path/to/input/files/pat1-D_S1_L001_R2_001.fastq.gz"
}
}
]
},
{
"definition": {
"sampleId": "S2",
"multiplexId": "S2",
"sgaPipelineId": 123,
"userRef": "pat1-R",
"sampleTypeId": 308000,
"libraryType": "rna"
},
"patient": {
"personalInformationId": 222222222,
"medicalInformationId": 111111111
},
"isControlSample": false,
"files": [
{
"definition": {
"name": "/path/to/input/files/pat1-R_S1_L001_R1_001.fastq.gz"
}
},
{
"definition": {
"name": "/path/to/input/files/pat1-R_S1_L001_R2_001.fastq.gz"
}
}
]
}
],
"topology": [
{
"definition": {
"type": "mys"
},
"references": [
{
"analysisReference": {
"sampleId": "S1"
},
"role": "dna",
"metadata": {}
},
{
"analysisReference": {
"sampleId": "S2"
},
"role": "rna",
"metadata": {}
}
]
}
],
"files": []
}
}
Full tumorNormal Example
{
"protocolName": "ADE",
"protocolVersion": "1",
"client": {
"id": 3,
"userId": 700
},
"request": {
"definition": {
"userRef": "Baseline Super Run™",
"sequencerId": 123456,
"requestDate": 1568975438,
"isPairedEnd": true,
"isPrevent": false
},
"analyses": [
{
"definition": {
"sampleId": "S1",
"multiplexId": "S1",
"sgaPipelineId": 123,
"userRef": "pat1-N",
"sampleTypeId": 308000,
"libraryType": "dna"
},
"patient": {
"personalInformationId": 222222222,
"medicalInformationId": 111111111
},
"isControlSample": false,
"files": [
{
"definition": {
"name": "/path/to/input/files/pat1-N_S1_L001_R1_001.fastq.gz"
}
},
{
"definition": {
"name": "/path/to/input/files/pat1-N_S1_L001_R2_001.fastq.gz"
}
}
]
},
{
"definition": {
"sampleId": "S2",
"multiplexId": "S2",
"sgaPipelineId": 123,
"userRef": "pat1-T",
"sampleTypeId": 308000,
"libraryType": "dna"
},
"patient": {
"personalInformationId": 222222222,
"medicalInformationId": 111111111
},
"isControlSample": false,
"files": [
{
"definition": {
"name": "/path/to/input/files/pat1-T_S1_L001_R1_001.fastq.gz"
}
},
{
"definition": {
"name": "/path/to/input/files/pat1-T_S1_L001_R2_001.fastq.gz"
}
}
]
}
],
"topology": [
{
"definition": {
"type": "tumorNormal"
},
"references": [
{
"analysisReference": {
"sampleId": "S1"
},
"role": "normal",
"metadata": {}
},
{
"analysisReference": {
"sampleId": "S2"
},
"role": "tumor",
"metadata": {}
}
]
}
],
"files": []
}
}
MSK ACCESS example
{
"protocolName": "ADE",
"protocolVersion": "1",
"client": {
"id": 3,
"userId": 66238
},
"request": {
"definition": {
"userRef": "MSKFASTQ2_202402131741",
"sequencerId": 1606000,
"requestDate": 1707842498,
"isPairedEnd": true,
"isPrevent": false
},
"state": null,
"analyses": [
{
"definition": {
"sampleId": "S01",
"multiplexId": "S01",
"sgaPipelineId": 7043,
"userRef": "ACCESS-1",
"sampleTypeId": 1408000,
"libraryType": ""
},
"patient": {
"personalInformationId": 200038697,
"medicalInformationId": 200058694
},
"isControlSample": false,
"files": [
{
"definition": {
"name": "/path/to/input/files/ACCESS-1-N_S01_L001_R2_001.fastq.gz"
}
},
{
"definition": {
"name": "/path/to/input/files/ACCESS-1-N_S01_L001_R1_001.fastq.gz"
}
}
]
},
{
"definition": {
"sampleId": "S02",
"multiplexId": "S02",
"sgaPipelineId": 7043,
"userRef": "ACCESS-1",
"sampleTypeId": 1308000,
"libraryType": ""
},
"patient": {
"personalInformationId": 200038697,
"medicalInformationId": 200058694
},
"isControlSample": false,
"files": [
{
"definition": {
"name": "/path/to/input/files/ACCESS-1-T_S02_L001_R2_001.fastq.gz"
}
},
{
"definition": {
"name": "/path/to/input/files/ACCESS-1-T_S02_L001_R1_001.fastq.gz"
}
}
]
},
{
"definition": {
"sampleId": "S03",
"multiplexId": "S03",
"sgaPipelineId": 7043,
"userRef": "ACCESS-2",
"sampleTypeId": 1308000,
"libraryType": ""
},
"patient": {
"personalInformationId": 281900929,
"medicalInformationId": 338306461
},
"isControlSample": false,
"files": [
{
"definition": {
"name": "/path/to/input/files/ACCESS-2-T_S03_L001_R1_001.fastq.gz"
}
},
{
"definition": {
"name": "/path/to/input/files/ACCESS-2-T_S03_L001_R2_001.fastq.gz"
}
}
]
},
{
"definition": {
"sampleId": "S04",
"multiplexId": "S04",
"sgaPipelineId": 7043,
"userRef": "ACCESS-3",
"sampleTypeId": 1508000,
"libraryType": ""
},
"patient": {
"personalInformationId": 200038701,
"medicalInformationId": 200058698
},
"isControlSample": true,
"files": [
{
"definition": {
"name": "/path/to/input/files/ACCESS-3-CP_S04_L001_R2_001.fastq.gz"
}
},
{
"definition": {
"name": "/path/to/input/files/ACCESS-3-CP_S04_L001_R1_001.fastq.gz"
}
}
]
}
],
"topology": [
{
"definition": {
"type": "tumorNormal"
},
"references": [
{
"analysisReference": {
"sampleId": "S01"
},
"role": "normal",
"metadata": {}
},
{
"analysisReference": {
"sampleId": "S02"
},
"role": "tumor",
"metadata": {}
}
]
}
],
"files": []
}
}
Serial numbers
Using bdsNumber and bdsMappingFile Flags
The script supports two flags for specifying Serial numbers, which are unique identifiers required for all SOPHiA GENETICS bundle solutions. These are --bdsNumber for a single Serial number applicable to all samples, and --bdsMappingFile for a file containing specific Serial numbers mapped to patient references.
--bdsNumber Flag
- This flag allows you to specify a single Serial number that will be applied to all samples processed by the script. It's useful when all samples in the dataset can be associated with the same Serial number.
- The provided Serial number must follow the format
BDS-XXXXXXXXXX-XX, whereXXXXXXXXXXis a sequence of 10 digits, andXXis a two-digit number that equals the sum of the previous 10 digits. - Example usage:
--bdsNumber BDS-0000020695-22
--bdsMappingFile Flag
- This flag allows you to specify a file containing mappings from patient references to specific Serial numbers. It's useful when different samples or patients in the dataset require different Serial numbers.
- The file must be formatted as CSV (comma separated value), with each line containing a patient reference followed by a comma and the corresponding Serial number. Each Serial number must follow the format
BDS-XXXXXXXXXX-XX. - The file format should look like this:
patient_ref1, BDS-0000020695-22
patient_ref2, BDS-0000030695-24
...
- Example usage:
--bdsMappingFile path/to/bds_mapping_file.csv
Validating Serial Numbers
Both the --bdsNumber and --bdsMappingFile flags will trigger validation of the provided Serial numbers to ensure they conform to the required format and checksum. If an invalid Serial number is encountered, the script will terminate with an error message.