Overview

This script automatically detects new runs to upload, and then downloads the completed analyses. A design file containing certain metadata must be provided; when all files are present, an ADE file is generated and the run is uploaded. When the pipeline has finished, the results are downloaded.

Note: The sg-monitor tool relies on a Java-based (see CLI JAR) to communicate with the Sophia Genetics platform. By default, sg-monitor will automatically check for updates and download new versions when available.

Availability / Limitations

sg-monitor is not available to all customers in all regions due to design limitations. If you want to use sg-monitor, please contact your Sophia Genetics representative or Support Team (support@sophiagenetics.com).

Prerequisites

  • Java 8
  • Python 3.8 or newer
    • See requirements.txt for dependencies
  • Supported platforms: MacOS, Linux

Installation

  • Install the latest version of sg-monitor 2.3.0 (see releases page):
pip install https://api-fr.sophiagenetics.com/uploader/cli/automation/assets/sg-monitor-2.3.0.tar.gz

or download package and install it manually:

pip install sg-monitor-2.3.0.tar.gz

Authentication

At first, you need to login with your Sophia Genetics DDM credentials.

Please run the following command to authenticate:

  • Download Sophia Genetics CLI JAR wrapper script:
wget https://api-fr.sophiagenetics.com/uploader/cli/scripts/sg-upload-v2-wrapper.py
  • Login with your Sophia Genetics credentials:
python sg-upload-v2-wrapper.py login -u <username> -p <password>

It will automatically download the latest version of the CLI JAR, save it to the current directory and ask you the password.

Note: currently Sophia Genetics CLI JAR supports only two-factor authentication with grid card.

See Sophia Genetics CLI JAR Authentication section for more details.

Use

sg-monitor -h

Output:

usage: sg-monitor [-h] [-v] [-u] [-y YAML] [--skip-download] [folder]

Monitor specified folder

positional arguments:
  folder         Path to the folder containing run repos

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  Print version
  -u, --update   Update sg-monitor to the latest version
  -y YAML        YAML file to override application settings
  --skip-download Skip downloading results

Example usage:

sg-monitor /Users/alice/seq_output/

To monitor a folder without downloading results when analyses complete:

sg-monitor --skip-download /Users/alice/seq_output/

Each subfolder will be checked, and if ready, the upload process begins. When the analysis is complete, the results are downloaded to that folder (unless --skip-download is specified).

Example directory structure in which the results for repo001 have been downloaded, and repo002 has not been started.

|--- seq_output
    |--- repo001
        |--- designFile.xlsx
        |--- pr2543_S1_R1_001.fastq.gz
        |--- pr2543_S1_R1_002.fastq.gz
        |--- pr5675_S2_R1_001.fastq.gz
        |--- pr5675_S2_R1_002.fastq.gz
        |--- sg_downloads
            |--- full_variant_table.txt
                        .
                        .
                        .
            |--- CNV-Report.pdf
    |--- repo002
        |--- designFile.xlsx
        |--- pr676_S1_R1_001.fastq.gz
        |--- pr676_S1_R1_002.fastq.gz
        |--- pr223_S2_R1_001.fastq.gz
        |--- pr223_S2_R1_002.fastq.gz

Design File

Here is found the list of fastq files along with the pipeline/sequencer, sample type, tissue type, gene panel, and patient ref. See this example.

The current implementation uses an Excel spreadsheet, but other formats can be used by extending the DesignFileParser class. There is no naming convention for the design file - as long as there is a single .xls or .xlsx in the folder, the monitor will find it. The order of the columns is not important.

Patient Ref Filename Pipeline Library Type Sample Type Gene Panel Parent ID Sequencer LIMS ID
string (30) string (256) int string int/string string long int string
  • Patient Ref Identifier for the patient (must not contain personally identifiable information) (required)
  • Filename fastq.gz file without full path (required)
  • Pipeline Numeric code for pipeline (required)
  • Library Type One of DNA/RNA/Tumor/Normal (default DNA)
  • Sample Type One of FFPE, BONE_MARROW, etc. or numeric code (required)
  • Gene Panel List of gene names for restriction (default none)
  • Parent ID ID of the root panel for the Gene Panel (if not given, the monitor will try to figure it out)
  • Sequencer Numeric code for sequencer (required)
  • LIMS ID Client-assigned ID for this run

The same patient can have multiple files. Each file requires a pipeline, but they must be the same per patient. Sequencer and LIMS ID are only required once.

How it works

A new thread is started for each folder to determine its status and take appropriate action. A file called sg_lock is written to the repo folder for the lifetime of the thread to prevent another thread trying to process the same folder if the monitor makes another sweep before an action is completed. Below is a summary of statuses and their actions.

Status Meaning Action
Ready Expected files present Create & Upload
Waiting for upload Run created but not uploaded Upload
Finished Analysis completed Download
Locked Being processed by another thread Skip
Not Ready Expected files not yet present Skip
Downloaded Analysis downloaded Skip
Upload in progress* Files uploading Skip
Pipeline running Analysis running Skip
Error Error in analysis Skip (Email sent by backend)
Status code unknown (x) Unknown error Skip (Email sent by backend)

*This won't actually occur, as a thread will have a lock on the repo for the duration of the upload.

A Day in the Life

Suppose a cron job runs the monitor every hour. Let's walk through what happens when it runs.

08h - There are no folders, so nothing happens.

09h - Not Ready There is a single folder, seq_run_123/, containing a design file and 3 fastq files. However, the design file lists 4 files, so the thread stops.

10h - Ready seq_run_123/ now contains all 4 files. The thread successfully creates a new run, writing the ID to seq_run_123/sg_id while the CLI writes the meta file ~/.sg-upload-client/upload_bar_X.json (X is the run ID), but thanks to a hiccough in the client's network, the upload fails.

11h - Waiting for upload Finding seq_run_123/sg_id and the meta file, the thread attempts to upload the run; this time, everything goes smoothly and the CLI deletes the meta file.

12h - Pipeline running Finding seq_run_123/sg_id but no meta file, the backend is polled for status. As the pipeline is running, the thread exits.

13h - Finished Again, having seq_run_123/sg_id but no meta file, the monitor requests the status from the backend. Now, the analysis is finished, so the results are download to seq_run_123/sg_downloads/ and seq_run_123/sg_id is deleted.

14h - Downloaded As seq_run_123/sg_downloads/ is already present, the thread finishes.

FlowChart

Update

When new versions of the sg-monitor is released, the log will contain a warning like this:

WARNING:New version of sg-monitor available: xxxx .Please update by running sg-monitor -u

You need to explicitly run update using the -u option:

sg-monitor -u

Releases

see Releases

FAQ

see FAQ