Overview

This script automatically detects new runs to upload, and then downloads the completed analyses. A design file containing certain metadata must be provided; when all files are present, an ADE file is generated and the run is uploaded. When the pipeline has finished, the results are downloaded.

Prerequisites

Use

Create a cron job (or Scheduled Task on Windows) to execute sg_monitor/monitor.py regularly with the folder to monitor as the only argument.
python monitor.py /Users/alice/seq_output/

Each subfolder will be checked, and if ready, the upload process begins. When the analysis is complete, the results are downloaded to that folder.

Example directory structure in which the results for repo001 have been downloaded, and repo002 has not been started.
|--- seq_output
    |--- repo001
        |--- designFile.xlsx
        |--- pr2543_S1_R1_001.fastq.gz
        |--- pr2543_S1_R1_002.fastq.gz
        |--- pr5675_S2_R1_001.fastq.gz
        |--- pr5675_S2_R1_002.fastq.gz
        |--- sg_downloads
            |--- full_variant_table.txt
                        .
                        .
                        .
            |--- CNV-Report.pdf
    |--- repo002
        |--- designFile.xlsx
        |--- pr676_S1_R1_001.fastq.gz
        |--- pr676_S1_R1_002.fastq.gz
        |--- pr223_S2_R1_001.fastq.gz
        |--- pr223_S2_R1_002.fastq.gz

Design File

Here is found the list of fastq files along with the pipeline/sequencer, sample type, tissue type, gene panel, and patient ref. See this example.

The current implementation uses an Excel spreadsheet, but other formats can be used by extending the DesignFileParser class. There is no naming convention for the design file - as long as there is a single .xls or .xlsx in the folder, the monitor will find it. The order of the columns is not important.

Patient Ref Filename Pipeline Library Type Sample Type Gene Panel Parent ID Sequencer LIMS ID
string (30) string (256) int string int/string string long int string

The same patient can have multiple files. Each file requires a pipeline, but they must be the same per patient. Sequencer and LIMS ID are only required once.

How it works

A new thread is started for each folder to determine its status and take appropriate action. A file called sg_lock is written to the repo folder for the lifetime of the thread to prevent another thread trying to process the same folder if the monitor makes another sweep before an action is completed. Below is a summary of statuses and their actions.

Status Meaning Action
Ready Expected files present Create & Upload
Waiting for upload Run created but not uploaded Upload
Finished Analysis completed Download
Locked Being processed by another thread Skip
Not Ready Expected files not yet present Skip
Downloaded Analysis downloaded Skip
Upload in progress* Files uploading Skip
Pipeline running Analysis running Skip
Error Error in analysis Skip (Email sent by backend)
Status code unknown (x) Unknown error Skip (Email sent by backend)

*This won’t actually occur, as a thread will have a lock on the repo for the duration of the upload.

A Day in the Life

Suppose a cron job runs the monitor every hour. Let’s walk through what happens when it runs.

08h - There are no folders, so nothing happens.

09h - Not Ready There is a single folder, seq_run_123/, containing a design file and 3 fastq files. However, the design file lists 4 files, so the thread stops.

10h - Ready seq_run_123/ now contains all 4 files. The thread successfully creates a new run, writing the ID to seq_run_123/sg_id while the CLI writes the meta file ~/.sg-upload-client/upload_bar_X.json (X is the run ID), but thanks to a hiccough in the client’s network, the upload fails.

11h - Waiting for upload Finding seq_run_123/sg_id and the meta file, the thread attempts to upload the run; this time, everything goes smoothly and the CLI deletes the meta file.

12h - Pipeline running Finding seq_run_123/sg_id but no meta file, the backend is polled for status. As the pipeline is running, the thread exits.

13h - Finished Again, having seq_run_123/sg_id but no meta file, the monitor requests the status from the backend. Now, the analysis is finished, so the results are download to seq_run_123/sg_downloads/ and seq_run_123/sg_id is deleted.

14h - Downloaded As seq_run_123/sg_downloads/ is already present, the thread finishes.

Monitor Flowchart