Metadata-Version: 2.1
Name: sg-monitor
Version: 1.1.3
Summary: Python scripts to monitor sequencer data
Home-page: https://sophiagenetics.com]
Author: Stephan Curran, Vasyl Vaskul
Author-email: scurran@sophiagenetics.com
License: Copyright (C) Sophia Genetics S.A.
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.9
Description-Content-Type: text/markdown

# Overview

This script automatically detects new runs to upload, and then downloads the completed analyses. A _design file_
containing certain metadata must be provided; when all files are present, an ADE file is generated and the
run is uploaded. When the pipeline has finished, the results are downloaded.

# Prerequisites

- Java 8
- Python 3.7 or newer
    - See requirements.txt for dependencies
- (Optional) [Bazel](https://bazel.build/) (bazel build needed in case tink dependency doesn't have published binary for your environment. Example: MacOS 12)

# Package the script 

```
python3 setup.py sdist
```

# Installation

* Download the latest version of script from [here](../scripts/sg-monitor-latest.zip).

* Unzip the archive:

```
unzip sg-monitor-latest.zip
```

* Go to the root directory of the unzipped archive:

```
cd sg-monitor
```

* Create and activate virtual environment:

```
python3 -m venv venv
source venv/bin/activate
```

* Install dependencies:

```
pip install -r requirements.txt
```


Notes:

If you have issues installing _tink_ dependency on Mac OS please download pre-built distribution for your platform from [tink-bazel-release](https://pypi.org/project/tink/#files)


# Setup

At first you need to download Sophia Genetics CLI and login with your Sophia Genetics DDM credentials.

See Sophia Genetics CLI JAR [Authentication](../docs/#login) section for more details. 

Please run the following command to authenticate:

```
python sg-upload-v2-wrapper.py login -u <username> -p <password>
```

It will automatically download the latest version of the CLI JAR, save it to the current directory and ask you the password.

_Note: currently Sophia Genetics CLI JAR supports only two-factor authentication with grid card._


## Use

Create a [cron job](https://en.wikipedia.org/wiki/Cron) (or [Scheduled Task](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/schtasks) on Windows) to execute sg_monitor/monitor.py regularly with the
folder to monitor as the only argument.
<pre>
python monitor.py /Users/alice/seq_output/
</pre>
Each subfolder will be checked, and if ready, the upload process begins. When the analysis is complete, the results are
downloaded to that folder.

Example directory structure in which the results for repo001 have been downloaded, and repo002 has not been started.
<pre>
|--- seq_output
    |--- repo001
        |--- designFile.xlsx
        |--- pr2543_S1_R1_001.fastq.gz
        |--- pr2543_S1_R1_002.fastq.gz
        |--- pr5675_S2_R1_001.fastq.gz
        |--- pr5675_S2_R1_002.fastq.gz
        |--- sg_downloads
            |--- full_variant_table.txt
                        .
                        .
                        .
            |--- CNV-Report.pdf
    |--- repo002
        |--- designFile.xlsx
        |--- pr676_S1_R1_001.fastq.gz
        |--- pr676_S1_R1_002.fastq.gz
        |--- pr223_S2_R1_001.fastq.gz
        |--- pr223_S2_R1_002.fastq.gz
</pre>

## Design File
Here is found the list of fastq files along with the pipeline/sequencer, sample type, tissue type, gene panel, and
patient ref. See this [example](docs/designFileExample.xlsx).

The current implementation uses an Excel spreadsheet, but other formats can be used by extending the DesignFileParser
class. There is no naming convention for the design file - as long as there is a single *.xls* or *.xlsx* in the folder,
the monitor will find it. The order of the columns is not important.

| Patient Ref | Filename | Pipeline | Library Type | Sample Type | Gene Panel | Parent ID | Sequencer | LIMS ID |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| string (30) | string (256) | int | string | int/string | string | long | int | string |

* **Patient Ref** Identifier for the patient (must not contain personally identifiable information) (required)
* **Filename** fastq.gz file *without* full path (required)
* **Pipeline** Numeric code for pipeline (required)
* **Library Type** One of DNA/RNA/Tumor/Normal (default DNA)
* **Sample Type** One of FFPE, BONE_MARROW, etc. or numeric code (required)
* **Gene Panel** List of gene names for restriction (default none)
* **Parent ID** ID of the root panel for the Gene Panel (if not given, the monitor will try to figure it out)
* **Sequencer** Numeric code for sequencer (required)
* **LIMS ID** Client-assigned ID for this run

The same patient can have multiple files. Each file requires a pipeline, but they must be the same per patient. Sequencer and LIMS ID are only required once.

# How it works
A new thread is started for each folder to determine its status and take appropriate action. A file called *sg_lock* is
written to the repo folder for the lifetime of the thread to prevent another thread trying to process the same
folder if the monitor makes another sweep before an action is completed. Below is a summary of statuses and their actions.

| Status | Meaning | Action |
| --- | --- | --- |
| Ready | Expected files present | Create & Upload |
| Waiting for upload | Run created but not uploaded | Upload |
| Finished | Analysis completed | Download |
| Locked | Being processed by another thread | Skip |
| Not Ready | Expected files not yet present | Skip |
| Downloaded | Analysis downloaded | Skip |
| Upload in progress* | Files uploading | Skip |
| Pipeline running | Analysis running | Skip |
| Error | Error in analysis | Skip (Email sent by backend) |
| Status code unknown (x) | Unknown error | Skip (Email sent by backend) |

*This won't actually occur, as a thread will have a lock on the repo for the duration of the upload.

## A Day in the Life
Suppose a cron job runs the monitor every hour. Let's walk through what happens when it runs.

08h - There are no folders, so nothing happens.

09h - **Not Ready** There is a single folder, *seq_run_123/*, containing a design file and 3 fastq files. However, the
design file lists 4 files, so the thread stops.

10h - **Ready** *seq_run_123/* now contains all 4 files. The thread successfully creates a new run, writing the ID to 
*seq_run_123/sg_id* while the CLI writes the meta file *~/.sg-upload-client/upload_bar_X.json* (*X* is the run ID), but
thanks to a hiccough in the client's network, the upload fails.

11h - **Waiting for upload** Finding *seq_run_123/sg_id* and the meta file, the thread attempts to upload the run;
this time, everything goes smoothly and the CLI deletes the meta file.

12h - **Pipeline running** Finding *seq_run_123/sg_id* but no meta file, the backend is polled for status. As the pipeline is running, the thread exits.

13h - **Finished** Again, having *seq_run_123/sg_id* but no meta file, the monitor requests the status from the backend.
Now, the analysis is finished, so the results are download to *seq_run_123/sg_downloads/* and *seq_run_123/sg_id* is deleted.


14h - **Downloaded** As *seq_run_123/sg_downloads/* is already present, the thread finishes.


![Monitor Flowchart](docs/flowchart.png "Monitor Flowchart")
