ADE Generation Script

This script extracts information from the names of files in a specified FastQ folder to build an ADE file for creating runs. It also uses your credentials on sg-upload-v2-latest.jar to run the following subcommands:

userInfo
patient --create
patient --list
pipeline --list

Usage

python3 adegen.py [-h] [-j JAR] [-o OUTPUT] [-r REF] [-p PIPELINE] [-s SAMPLETYPE] [-c] folder

Required Arguments

folder

the target folder containing FastQ files

Optional Arguments

-j JAR, --jar JAR

Full path of sg-upload-v2-latest.jar (looks in current directory if not given)

-o OUTPUT, --output OUTPUT

Output Json file (overwrites without warning). If not given, the ADE is written to stdout

-r REF, --ref REF

A name for the userRef attribute of the ADE file. If not given, one is generated from the folder name and the current date/time.

-p PIPELINE, --pipeline PIPELINE

ID of pipeline. This will retrieve the sequencer ID automatically. If not given, you will be asked to select from a list.

-s SAMPLETYPE, --sampletype SAMPLETYPE

sampleTypeId to apply to all samples - defaults to 108000 (Peripheral Blood)

-c, --confirm

Confirm use of script. If not used, you must confirm manually.

-y, --yaml

An application override file.

-fp, --forceplatform Force platform services

--bdsMappingFile

*Path to the mapping file containing mapping of patient references to Serial Numbers

--bdsNumber

*Mandatory Serial Number for all SOPHiA GENETICS bundle solutions

Prerequisites

Python 3
sg-upload-v2-latest.jar
Ensure there is no identifiable information in the file names

Naming Convention

The script relies on the following naming convention for files:

patient_ref<-X>_Sxxx_Lxxx_Rxxx_xxx.fastq.gz

where X is optional and is one of D (DNA), R (RNA), N (Normal), T (Tumor). For example:

pat-1_S1_L001_R1_001.fastq.gz

pat-1-D_S1_L001_R1_001.fastq.gz

The part up to -X or _Sxxx (pat-1 in this example) should be a maximum of 30 characters, and must not contain identifiable information as it will be used as a patient reference.

Analyses

Each sample ID will map to one analysis in the ADE and may contain multiple files.

Topologies

There are two topologies – mys and tumorNormal. These are made by having pairs of D/R or N/T files. For example:

mys	tumorNormal
pat-1-D_S1_L001_R1_001.fastq.gz pat-1-D_S1_L001_R2_001.fastq.gz pat-1-R_S2_L001_R1_001.fastq.gz pat-1-R_S2_L001_R2_001.fastq.gz	pat-1-N_S1_L001_R1_001.fastq.gz pat-1-N_S1_L001_R2_001.fastq.gz pat-1-T_S2_L001_R1_001.fastq.gz pat-1-T_S2_L001_R2_001.fastq.gz

Files which have unmatched D/R or N/T labels will still have an analysis entry, but no topology.

Example Usages

Minimum

python3 adegen.py ~/fastq/study001/

All Arguments

python3 adegen.py ~/fastq/study001/ -j ~/sg-upload-latest.jar -o ade.json -r Study001 -p 159 -s 708000 -c

NOTE - The same sampleTypeId and pipeline (and retrieved sequencer) will apply to all entries - Make sure you are already logged in with your credentials using the uploader tool

ADE Format Description

This documentation gives a technical description of the ADE format, as well as an example of how to build this json in a DNA/RNA analysis case. The resulting json can be used as input of the uploader tool.

Schema

{
  "protocolName": "ADE",
  "protocolVersion": "1",
  "client": {
    "id": SOPHiA DDM global client ID,
    "userId": User technical ID (internal or external) for the request
  },
  "request": {
    "definition": {
      "userRef": Name of the request (defined by the client),
      "sequencerId": Technical ID of the sequencing tool used for this request,
      "requestDate": Date of the request. Formatted as epoch seconds (UNIX timestamp),
      "isPairedEnd": Whether the sequencing tool creates pairs of files per sample,
      "isPrevent": Optional, whether this request is linked to the SOPHiA PREVENT product. Defaults to false
    },
    "analyses": [
      {
        "definition": {
          "sampleId": Unique sample ID within the request. #multiplexId can be used,
          "multiplexId": Sequencer multiplex ID,
          "sgaPipelineId": Technical pipeline ID,
          "userRef": Name of the sample (defined by the user). The patient ref can be used,
          "sampleTypeId": Sample type definition ID (blood, FFPE...) please see after for ,
          "sisNumber": Optional unique SIS purchase order number. Only required for BDS-activated pipelines,
          "libraryType": Optional library type (dna / rna). Defaults to dna,
          "bdsNumber": Mandatory Serial Number for all SOPHiA GENETICS bundle solutions. Can be found on the sticker on the side of box 1 of the bundle solution kit.
        },
        "patient": {
          "personalInformationId": Personal information ID, from the SGP service,
          "medicalInformationId": Medical information ID, from the SGA service
        },
        "restriction": Optional gene panel, in accordance with the consent of the patient. IMPORTANT: Required for somatic analyses,
        "genePanel": { // Optional gene panel tag for the analysis
          "id": Technical gene panel ID,
          "version": Gene panel version,
          "regionsHash": hash of the region names of the gene panel
        },
        "isControlSample": Optional flag for control samples. Defaults to false,
        "files": [ // Files attached to the analysis. Can be empty in the context of remotely referenced samples (see request#topology)
          {
            "definition": {
              "name": Original file name, or absolute path
            },
            "controlKey": Optional original control key, usually a md5 sum of the compressed file
          },
          {
            "definition": {
              "name": Original file name, or absolute path
            },
            "controlKey": Optional original control key, usually a md5 sum of the compressed file
          }
        ]
      }
    ],
    "topology": [
      {
        "definition": {
          "type": Type of the source. e.g.: replicate, tumorNormal, mys
        },
        "references": [
          {
            "analysisReference": {
              "sampleId": Unique sample ID for a given request
            },
            "requestReference": { // Optional reference to a request, allowing to refer to analyses from existing requests,
              "owningClientId": Optional client ID. To be used if the owner of the referred request is different than the current request owner,
              "requestId": Referred request ID
            },
            "role": "dna",
            "metadata": {}
          },
          {
            "analysisReference": {
              "sampleId": Unique sample ID for a given request
            },
            "requestReference": null, // Optional reference to a request, allowing to refer to analyses from existing requests,
            "role": Role of the analysis in the context of this reference,
            "metadata": {} // Context-specific metadata attached to this reference
          }
        ]
      }
    ],
    "files": [] // Additional files attached to the request
  }
}

Building ADE JSON

Pre-requisites

Download the uploader tool from the provided documentation
Login using the uploader tool with your account
Go to https://www.unixtimestamp.com/ to retrieve the "Current Unix timestamp (in seconds)"; this will be used to fill the requestDate field.
Get your clientId and userId using the userInfo command:

userId and clientId

$ python3 sg-upload-v2-wrapper.py userInfo

{
  "userId": 405,
  "loginUsername": "dnoble",
  "clientId": 12
}

Get the pipeline_id and sequencer_id you want to apply to the samples (please see the provided documentation).

pipeline_id and sequencer_id

$ python3 sg-upload-v2-wrapper.py pipeline --list

[
  {
    "pipeline_id": 123,
    "pipeline_name": "Pipeline 123",
    "analysis_type": "BRCA",
    "kit": "Multiplicom_MASTR_assay",
    "sequencer_id": 123456
    "sequencer": "ILLUMINA_MiSeq",
    "experiment_type": "germline",
    "paired": true
  },
  {
    "pipeline_id": 456,
    "pipeline_name": "Pipeline 456",
    "analysis_type": "HCS_v1_1",
    "kit": "IDT",
    "sequencer_id": 123456
    "sequencer": "ILLUMINA_MiSeq",
    "experiment_type": "germline",
    "paired": true
  }
]

On the left are files from a DNA/RNA case (note -D and -R), and on the right is a tumorNormal case (note -N and -T). The part up to -D/-R/-N/-T is the patient ref and should be a maximum of 30 characters.


pat1-D_S1_L001_R1_001.fastq.gz pat1-D_S1_L001_R2_001.fastq.gz pat1-R_S1_L001_R1_001.fastq.gz pat1-R_S1_L001_R2_001.fastq.gz	pat1-N_S1_L001_R1_001.fastq.gz pat1-N_S1_L001_R2_001.fastq.gz pat1-T_S1_L001_R1_001.fastq.gz pat1-T_S1_L001_R2_001.fastq.gz

Create this patient using the command line (please have a look to the provided documentation):

$ python3 sg-upload-v2-wrapper.py userInfo patient --create --patient-ref pat1

Remaining patient(s) to create: pat1
Have saved 1 patient(s)

Then retrieve personalInformationId and medicalInformationId.

medicalInformationId and personalInformationId

$ python3 sg-upload-v2-wrapper.py userInfo patient --list --patient-ref pat1

Should display:
[
  {
    "medicalInformationId": 111111111,
    "personalInformationId": 222222222,
    "userRef": "pat1"
  }
]

For each sample, build the json chunk of the DNA analysis, using the information gathered above (note that pipeline_id = sgaPipelineId). The libraryType field also depends on the sample type.

Example for DNA analysis

{
  "definition": {
    "sampleId": "S1",
    "multiplexId": "S1",
    "sgaPipelineId": 123,
    "userRef": "pat1-D",      // patient ref followed by -D/-R/-N/-T
    "sampleTypeId": 308000,
    "libraryType": "dna".     // -D = dna, -R = rna, -N = dna, -T = dna
  },
  "patient": {
    "personalInformationId": 222222222,
    "medicalInformationId": 111111111
  },
  "isControlSample": false,
  "files": [
    {
      "definition": {
        "name": "/path/to/input/files/pat1-D_S1_L001_R1_001.fastq.gz"
      }
    },
    {
      "definition": {
        "name": "/path/to/input/files/pat1-D_S1_L001_R2_001.fastq.gz"
      }
    }
  ]
}

sampleTypeId should be one of:

OTHER = 8000
PERIPHERAL_BLOOD = 108000
FRESH_TUMOR = 208000
FFPE = 308000
BIOPSY = 408000
CELL_LINE = 508000
CTDNA = 608000
BUCCAL_SWAB = 708000
NASOPHARYNGEAL_SWAB = 808000
SPUTUM = 908000
BRONCHOALVEOLAR_LAVAGE = 1008000
SALIVA = 1108000
BONE_MARROW = 1208000

To link DNA with RNA or Normal with Tumor analyses in our system, it is necessary to generate a "topology" JSON chunk like below (comments have to be removed from the built JSON):

DNA/RNA Topology

{
  "definition": {
    "type": "mys"
  },
  "references": [
    {
      "analysisReference": {
        "sampleId": "S1"
      },
      "role": "dna",
      "metadata": {}
    },
    {
      "analysisReference": {
        "sampleId": "S2"
      },
      "role": "rna",
      "metadata": {}
    }
  ]
}

tumorNormal Topology

{
  "definition": {
    "type": "tumorNormal"
  },
  "references": [
    {
      "analysisReference": {
        "sampleId": "S1"
      },
      "role": "tumor",
      "metadata": {}
    },
    {
      "analysisReference": {
        "sampleId": "S2"
      },
      "role": "normal",
      "metadata": {}
    }
  ]
}

Full DNA/RNA Example

{
  "protocolName": "ADE",
  "protocolVersion": "1",
  "client": {
    "id": 3,
    "userId": 700
  },
  "request": {
    "definition": {
      "userRef": "Baseline Super Run™",
      "sequencerId": 123456,
      "requestDate": 1568975438,
      "isPairedEnd": true,
      "isPrevent": false
    },
    "state": null,
    "analyses": [
      {
        "definition": {
          "sampleId": "S1",
          "multiplexId": "S1",
          "sgaPipelineId": 123,
          "userRef": "pat1-D",
          "sampleTypeId": 308000,
          "libraryType": "dna"
        },
        "patient": {
          "personalInformationId": 222222222,
          "medicalInformationId": 111111111
        },
        "isControlSample": false,
        "files": [
          {
            "definition": {
              "name": "/path/to/input/files/pat1-D_S1_L001_R1_001.fastq.gz"
            }
          },
          {
            "definition": {
              "name": "/path/to/input/files/pat1-D_S1_L001_R2_001.fastq.gz"
            }
          }
        ]
      },
      {
        "definition": {
          "sampleId": "S2",
          "multiplexId": "S2",
          "sgaPipelineId": 123,
          "userRef": "pat1-R",
          "sampleTypeId": 308000,
          "libraryType": "rna"
        },
        "patient": {
          "personalInformationId": 222222222,
          "medicalInformationId": 111111111
        },
        "isControlSample": false,
        "files": [
          {
            "definition": {
              "name": "/path/to/input/files/pat1-R_S1_L001_R1_001.fastq.gz"
            }
          },
          {
            "definition": {
              "name": "/path/to/input/files/pat1-R_S1_L001_R2_001.fastq.gz"
            }
          }
        ]
      }
    ],
    "topology": [
      {
        "definition": {
          "type": "mys"
        },
        "references": [
          {
            "analysisReference": {
              "sampleId": "S1"
            },
            "role": "dna",
            "metadata": {}
          },
          {
            "analysisReference": {
              "sampleId": "S2"
            },
            "role": "rna",
            "metadata": {}
          }
        ]
      }
    ],
    "files": []
  }
}

Full tumorNormal Example

{
  "protocolName": "ADE",
  "protocolVersion": "1",
  "client": {
    "id": 3,
    "userId": 700
  },
  "request": {
    "definition": {
      "userRef": "Baseline Super Run™",
      "sequencerId": 123456,
      "requestDate": 1568975438,
      "isPairedEnd": true,
      "isPrevent": false
    },
    "analyses": [
      {
        "definition": {
          "sampleId": "S1",
          "multiplexId": "S1",
          "sgaPipelineId": 123,
          "userRef": "pat1-N",
          "sampleTypeId": 308000,
          "libraryType": "dna"
        },
        "patient": {
          "personalInformationId": 222222222,
          "medicalInformationId": 111111111
        },
        "isControlSample": false,
        "files": [
          {
            "definition": {
              "name": "/path/to/input/files/pat1-N_S1_L001_R1_001.fastq.gz"
            }
          },
          {
            "definition": {
              "name": "/path/to/input/files/pat1-N_S1_L001_R2_001.fastq.gz"
            }
          }
        ]
      },
      {
        "definition": {
          "sampleId": "S2",
          "multiplexId": "S2",
          "sgaPipelineId": 123,
          "userRef": "pat1-T",
          "sampleTypeId": 308000,
          "libraryType": "dna"
        },
        "patient": {
          "personalInformationId": 222222222,
          "medicalInformationId": 111111111
        },
        "isControlSample": false,
        "files": [
          {
            "definition": {
              "name": "/path/to/input/files/pat1-T_S1_L001_R1_001.fastq.gz"
            }
          },
          {
            "definition": {
              "name": "/path/to/input/files/pat1-T_S1_L001_R2_001.fastq.gz"
            }
          }
        ]
      }
    ],
    "topology": [
      {
        "definition": {
          "type": "tumorNormal"
        },
        "references": [
          {
            "analysisReference": {
              "sampleId": "S1"
            },
            "role": "normal",
            "metadata": {}
          },
          {
            "analysisReference": {
              "sampleId": "S2"
            },
            "role": "tumor",
            "metadata": {}
          }
        ]
      }
    ],
    "files": []
  }
}

MSK ACCESS example

{
   "protocolName": "ADE",
   "protocolVersion": "1",
   "client": {
      "id": 3,
      "userId": 66238
   },
   "request": {
      "definition": {
         "userRef": "MSKFASTQ2_202402131741",
         "sequencerId": 1606000,
         "requestDate": 1707842498,
         "isPairedEnd": true,
         "isPrevent": false
      },
      "state": null,
      "analyses": [
         {
            "definition": {
               "sampleId": "S01",
               "multiplexId": "S01",
               "sgaPipelineId": 7043,
               "userRef": "ACCESS-1",
               "sampleTypeId": 1408000,
               "libraryType": ""
            },
            "patient": {
               "personalInformationId": 200038697,
               "medicalInformationId": 200058694
            },
            "isControlSample": false,
            "files": [
               {
                  "definition": {
                     "name": "/path/to/input/files/ACCESS-1-N_S01_L001_R2_001.fastq.gz"
                  }
               },
               {
                  "definition": {
                     "name": "/path/to/input/files/ACCESS-1-N_S01_L001_R1_001.fastq.gz"
                  }
               }
            ]
         },
         {
            "definition": {
               "sampleId": "S02",
               "multiplexId": "S02",
               "sgaPipelineId": 7043,
               "userRef": "ACCESS-1",
               "sampleTypeId": 1308000,
               "libraryType": ""
            },
            "patient": {
               "personalInformationId": 200038697,
               "medicalInformationId": 200058694
            },
            "isControlSample": false,
            "files": [
               {
                  "definition": {
                     "name": "/path/to/input/files/ACCESS-1-T_S02_L001_R2_001.fastq.gz"
                  }
               },
               {
                  "definition": {
                     "name": "/path/to/input/files/ACCESS-1-T_S02_L001_R1_001.fastq.gz"
                  }
               }
            ]
         },
         {
            "definition": {
               "sampleId": "S03",
               "multiplexId": "S03",
               "sgaPipelineId": 7043,
               "userRef": "ACCESS-2",
               "sampleTypeId": 1308000,
               "libraryType": ""
            },
            "patient": {
               "personalInformationId": 281900929,
               "medicalInformationId": 338306461
            },
            "isControlSample": false,
            "files": [
               {
                  "definition": {
                     "name": "/path/to/input/files/ACCESS-2-T_S03_L001_R1_001.fastq.gz"
                  }
               },
               {
                  "definition": {
                     "name": "/path/to/input/files/ACCESS-2-T_S03_L001_R2_001.fastq.gz"
                  }
               }
            ]
         },
         {
            "definition": {
               "sampleId": "S04",
               "multiplexId": "S04",
               "sgaPipelineId": 7043,
               "userRef": "ACCESS-3",
               "sampleTypeId": 1508000,
               "libraryType": ""
            },
            "patient": {
               "personalInformationId": 200038701,
               "medicalInformationId": 200058698
            },
            "isControlSample": true,
            "files": [
               {
                  "definition": {
                     "name": "/path/to/input/files/ACCESS-3-CP_S04_L001_R2_001.fastq.gz"
                  }
               },
               {
                  "definition": {
                     "name": "/path/to/input/files/ACCESS-3-CP_S04_L001_R1_001.fastq.gz"
                  }
               }
            ]
         }
      ],
      "topology": [
         {
            "definition": {
               "type": "tumorNormal"
            },
            "references": [
               {
                  "analysisReference": {
                     "sampleId": "S01"
                  },
                  "role": "normal",
                  "metadata": {}
               },
               {
                  "analysisReference": {
                     "sampleId": "S02"
                  },
                  "role": "tumor",
                  "metadata": {}
               }
            ]
         }
      ],
      "files": []
   }
}

Serial numbers

Using `bdsNumber` and `bdsMappingFile` Flags

The script supports two flags for specifying Serial numbers, which are unique identifiers required for all SOPHiA GENETICS bundle solutions. These are --bdsNumber for a single Serial number applicable to all samples, and --bdsMappingFile for a file containing specific Serial numbers mapped to patient references.

`--bdsNumber` Flag

This flag allows you to specify a single Serial number that will be applied to all samples processed by the script. It's useful when all samples in the dataset can be associated with the same Serial number.
The provided Serial number must follow the format BDS-XXXXXXXXXX-XX, where XXXXXXXXXX is a sequence of 10 digits, and XX is a two-digit number that equals the sum of the previous 10 digits.
Example usage: --bdsNumber BDS-0000020695-22

`--bdsMappingFile` Flag

This flag allows you to specify a file containing mappings from patient references to specific Serial numbers. It's useful when different samples or patients in the dataset require different Serial numbers.
The file must be formatted as CSV (comma separated value), with each line containing a patient reference followed by a comma and the corresponding Serial number. Each Serial number must follow the format BDS-XXXXXXXXXX-XX.
The file format should look like this:

patient_ref1, BDS-0000020695-22
patient_ref2, BDS-0000030695-24
...

Example usage: --bdsMappingFile path/to/bds_mapping_file.csv

Validating Serial Numbers

Both the --bdsNumber and --bdsMappingFile flags will trigger validation of the provided Serial numbers to ensure they conform to the required format and checksum. If an invalid Serial number is encountered, the script will terminate with an error message.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search