Public | Automated Build

Last pushed: 10 months ago
Short Description
will be added
Full Description


This repository is an attempt to standardize most of our working
procedures for modern biological experiments. As a side product we
want to achieve real reproducible research, which is one the biggest
problems in modern science.

As we experienced the horror of hundreds of
*insert-your-favorite-interpreter-language-here* scripts floating
around without any documentation, we felt the need of a more
sophisticated approach. Thats why we are porting all our workflows to

Note: Every workflow is a work-in-progress an should be handeld as such. This repository does not make using command-line programs easier. You need to understand how the command-line programs and snakemake work.


Docker image

The easiest way is to use our Docker image. It comes with all software preinstalled and configured. (currently only develop branch has working Dockerfile, will be updated in next release)

docker pull kubiac/grosse-ngs-suite:develop

Then you create a directory, a Snakefile, and a config.json in the described structure (c.f. How it should be used) and start it with

docker run -v /path/to/dir:/data/in kubiac/grosse-ngs-suite:develop


For manual installation please install this software in a current version and make it available via the PATH-variable. Scripts for installation that we use ourselfs are available here.

  • python 3
  • snakemake
  • bwa
  • segemehl
  • bedtools (current and version 2.22.1 - for reasons)
  • samtools
  • GATK
  • picard
  • R (with rpy2 support)
  • featureCounts
  • FastQC
  • trimmomatic

Then clone this repository somewhere and point in your Snakefiles (examples in Workflows) to the rules that you want to use.

List of analyses

  • GATK SNP-Calling
  • CNV analysis
  • coverage analysis
  • ... other things that you can do by combining all the rules available

How it should be used

Every set of snakemake rules works only with a fixed naming schema and config-file. Both are documented here.

File Naming and Directory Structure

Our naming schema uses directories quite heavily to separate different parameters/programs/approaches. This is a subset of the complete naming schema/directory structure, which should suffice to get the gist:

├── config.json
├── Snakefile
├── data
│   └── reads
│       ├── filtered
│       └── raw
│           ├── reads_R1.fastq.gz
│           └── reads_R2.fastq.gz
├── plots
│   └── coverage
│       └── bwa
│           └── hg19
│               └── raw
│                   └── BRCA1-ENST00000468300.pdf
└── results
    ├── coverage
    │   └── bwa
    │       └── hg19
    │           └── raw
    │               └── sample1_illumina_trusightcancer_20_20.cov
    ├── de
    │   └── featureCounts
    │       └── hg19
    │           └── raw
    │               └── all.counts
    ├── mapping
    │   ├── bwa
    │   │   └── hg19
    │   │       └── raw
    │   │           └── sample1.bam
    │   └── segemehl        
    ├── qa
    │   └── fastqc
    │       └── raw
    │           └── sreads_R1_fastqc.html
    └── variants
        └── gatk
            └── hg19
                └── raw
                    └── patient1.vcf

Snakefile and config file

The most important part of the config is the definition of the reads-samples-patient relationship. This is done via 2 hashes:

"samples": {
    "patient1": [ "sample1", "sample2" ],
    "patient2": [ "sampleA", "sampleB" ]
"units": {
    "sample1": ["reads_R1.fastq.gz", "reads_R2.fastq.gz" ],
    "sample2": ["more_reads_R1.fastq.gz", "more_reads_R2.fastq.gz" ],
    "sampleA": ["S3_R1.fastq.gz", "S3_R2.fastq.gz" ],
    "sampleB": ["S4_R1.fastq.gz", "S4_R2.fastq.gz" ]

Sample config and Snakefile s are available in the repository as workflows.

Release-Numbering conventions

 │ │ └─ Patch level: the results are the same but visualization might have
 │ │        changed, bugs preventing output might get fixed
 │ └─── Minor version level: new entire workflows may be available, the
 │          results might have changed due to bug fixes or exchange of
 │          parameters, software, etc.
 └───── Major version level: restructuring of naming schema, results of
            of exitsing workflows might have changed
Docker Pull Command
Source Repository