This script automatically detects new runs to upload, and then downloads the completed analyses. A design file containing certain metadata must be provided; when all files are present, an ADE file is generated and the run is uploaded. When the pipeline has finished, the results are downloaded.
python monitor.py /Users/alice/seq_output/
Each subfolder will be checked, and if ready, the upload process begins. When the analysis is complete, the results are downloaded to that folder.
Example directory structure in which the results for repo001 have been downloaded, and repo002 has not been started.
|--- seq_output
|--- repo001
|--- designFile.xlsx
|--- pr2543_S1_R1_001.fastq.gz
|--- pr2543_S1_R1_002.fastq.gz
|--- pr5675_S2_R1_001.fastq.gz
|--- pr5675_S2_R1_002.fastq.gz
|--- sg_downloads
|--- full_variant_table.txt
.
.
.
|--- CNV-Report.pdf
|--- repo002
|--- designFile.xlsx
|--- pr676_S1_R1_001.fastq.gz
|--- pr676_S1_R1_002.fastq.gz
|--- pr223_S2_R1_001.fastq.gz
|--- pr223_S2_R1_002.fastq.gz
Here is found the list of fastq files along with the pipeline/sequencer, sample type, tissue type, gene panel, and patient ref. See this example.
The current implementation uses an Excel spreadsheet, but other formats can be used by extending the DesignFileParser class. There is no naming convention for the design file - as long as there is a single .xls or .xlsx in the folder, the monitor will find it. The order of the columns is not important.
| Patient Ref | Filename | Pipeline | Library Type | Sample Type | Gene Panel | Parent ID | Sequencer | LIMS ID |
|---|---|---|---|---|---|---|---|---|
| string (30) | string (256) | int | string | int/string | string | long | int | string |
The same patient can have multiple files. Each file requires a pipeline, but they must be the same per patient. Sequencer and LIMS ID are only required once.
A new thread is started for each folder to determine its status and take appropriate action. A file called sg_lock is written to the repo folder for the lifetime of the thread to prevent another thread trying to process the same folder if the monitor makes another sweep before an action is completed. Below is a summary of statuses and their actions.
| Status | Meaning | Action |
|---|---|---|
| Ready | Expected files present | Create & Upload |
| Waiting for upload | Run created but not uploaded | Upload |
| Finished | Analysis completed | Download |
| Locked | Being processed by another thread | Skip |
| Not Ready | Expected files not yet present | Skip |
| Downloaded | Analysis downloaded | Skip |
| Upload in progress* | Files uploading | Skip |
| Pipeline running | Analysis running | Skip |
| Error | Error in analysis | Skip (Email sent by backend) |
| Status code unknown (x) | Unknown error | Skip (Email sent by backend) |
*This won’t actually occur, as a thread will have a lock on the repo for the duration of the upload.
Suppose a cron job runs the monitor every hour. Let’s walk through what happens when it runs.
08h - There are no folders, so nothing happens.
09h - Not Ready There is a single folder, seq_run_123/, containing a design file and 3 fastq files. However, the design file lists 4 files, so the thread stops.
10h - Ready seq_run_123/ now contains all 4 files. The thread successfully creates a new run, writing the ID to seq_run_123/sg_id while the CLI writes the meta file ~/.sg-upload-client/upload_bar_X.json (X is the run ID), but thanks to a hiccough in the client’s network, the upload fails.
11h - Waiting for upload Finding seq_run_123/sg_id and the meta file, the thread attempts to upload the run; this time, everything goes smoothly and the CLI deletes the meta file.
12h - Pipeline running Finding seq_run_123/sg_id but no meta file, the backend is polled for status. As the pipeline is running, the thread exits.
13h - Finished Again, having seq_run_123/sg_id but no meta file, the monitor requests the status from the backend. Now, the analysis is finished, so the results are download to seq_run_123/sg_downloads/ and seq_run_123/sg_id is deleted.
14h - Downloaded As seq_run_123/sg_downloads/ is already present, the thread finishes.