Public | Automated Build

Last pushed: 2 years ago
Short Description
Bioinfo pipeline for cancer samples used in clinseq
Full Description


Library naming

Sample should be named of the format PROJECT-SDID-TYPE-SAMPLEID-PREPID-CAPTUREID where

  • PROJECT is a two-letter short project designator. One of AL (alascca), LB (liquid biopspy) and OT (other)
  • SDID is an identifier for a single individual.
  • TYPE is the sample type, one of T (tumor), N (normal) and P (ctDNA)
  • SAMPLEID identifies a single biological sample, for example piece of a tumor or a single tube of plasma. The combination SDID-TYPE-SAMPLEID must uniquely identify a single sample.
  • PREPID describes what prepkit was used. It must be a two-letter shortname and a single digit. If a second prep is carried out on the same sample, the digit is incremented by one. Examples: TD1, TD2 etc.
  • CAPTUREID describes the capture that was performed on a library.

Allowed Prep IDs

Autoseq know about the following preparation methods:


Allowed Capture IDs

Autoseq knows about the following capture kits:

  • CS = clinseq_v3_targets
  • CZ = clinseq_v4
  • EX = EXOMEV3
  • EO = EXOMEV1
  • RF = fusion_v1
  • CC = core_design
  • CD = discovery_coho
  • CB = big_design
  • TT = test-regions


Autoseq can use any of the runners implemented in pypedream, shellrunner (default), localqrunner or slurmrunner.

General options

--libdir is the directory where the libraries live. Each library should have its own subdirectory where fastq.gz files can be placed. Autoseq recoginzes files on the format _1.fastq.gz/_2.fastq.gz.

LiqBio pipeline

The Liquid Biopsy pipeline is invoked by

autoseq --ref ref.json --outdir /path/to/outdir --jobdb jobdb.json --cores 5 --runner_name slurmrunner --libdir /path/to/libdir liqbio sample.json

The sample.json file has the format

    "sdid": "NA12877",
    "panel": {
        "T": "NA12877-T-03098849-TD1-TT1",
        "N": "NA12877-N-03098121-TD1-TT1",
        "P": ["NA12877-P-03098850-TD1-TT1", "NA12877-P-03098850-TD2-TT1"]
    "wgs": {
        "T": "NA12877-T-03098849-TD1-WGS",
        "N": "NA12877-N-03098121-TD1-WGS",
        "P": ["NA12877-P-03098850-TD1-WGS"]

In this file, a single tumor and normal sample is allowed, but multiple plasma samples. If no tumor or normal sample is avaialble, they can be set to null, but if no plasma samples are available, it should be set to [] (empty list), for example "P": [].

For the plasma samples, merging of libraries will take place before calling. On alignment, the @RG tag will be set as follows:


Of note is that the library tag (LB) does not include the CAPTUREID part, to ensure that PCR duplicates are removed correctly.

If a single prepared samples is exposed to capture twice, to create the libraries NA12877-T-49-TD1-TT1 and NA12877-T-49-TD1-TT2 (note different digits in the capture id), read pairs being identical between the two libraries should be considered duplicates since the sample was split after the final PCR step. Therefore, the LB for these libraries is set to NA12877-T-49-TD1. After merging the bam files, removal of PCR duplicates is done using Picard MarkDuplicates, which will do the right thing.

Automated testing on travis-ci

For automated testing, a test reference genome and a test datas set with relevant data are supplied.

Reference genome

The test reference genome and assets is available for download at This archive contains a sliced version of a full set of genome files for autoseq, including various key genes.

The whole chromosomes 3, 10, 17, X and Y are selected, after which everything except the following regions have been masked (to speed up alignment):

3    178863388    179014224    PIK3CA_150k
10    83068546    96283182    PTEN_13M
17    7558477    7589399    TP53_30k
X    66782057    66796840    14k_AR_exon
Y    6810425    6825985    15k_on_Y

From these regions, key exons and various other regions have been selected to mimic a small exome.

The Test Dataset

A sythetic tumor/normal/plasma dataset has been created for testing purpuses. From the illumina platinum 200x WGS sample from NA12877, read pairs from the seleted targets have been extracted. These reads have then been randomly assigned to create a virtual normal sample with ≈50x coverage, and remaining reads (≈150x coverage) have been put aside. To create a virtual tumor and a virtual plasma sample, variants have been spiked into the 150x data in the following positions:

  • TP53 insertion: MU2185182, chr17:g.7578475->G
  • TP53 deletion: MU25947, chr17:g.7577558G>-
  • TP53 DNV: MU52971976, chr17:g.7574003GG>AA
  • PIK3CA hotspot E545K, MU5219, chr3:g.178936091G>A
  • PTEN hotspot R130Q, MU29098, chr10:g.89692905G>A
  • PTEN hotspot R233*, MU589331, chr10:g.89717672C>T
  • AR intron variant, MU50988553, chrX:g.66788924G>A

In the virtual tumor, the target variant allele fraction (VAF) is 30% and in the virtual plasma sample the target VAF is 20%.

The variants have been selected from ICGC simple somatic mutations v20 with the aim to cover common small variants, including SNVs, deletions, insertions and DNVs. Note that the tests does not address the issue of global sensitivity and PPV of the pipeline, but are only intented to ensure that variants of all kinds are detected by the pipeline.

Docker Pull Command
Source Repository