Public | Automated Build

Last pushed: 2 years ago
Short Description
Wrapper image for a customized Gifford lab bcbio-nextgen pipeline.
Full Description


Wrapper scripts for running bcbio-nextgen pipelines.


Example files: we want to align /cluster/test/human.fastq and
/cluster/test/mouse.fastq in human and mouse respectively.

First change directory to /cluster/test and make the following csv
called align.csv:


For consistency, label your description in the format:


For example, a mouse dnase by Rich might look like:


Then run either:

/cluster/shortreads/utils/bcbio-wrapper/ align.csv

to run on our cluster (via SGE), or:

/cluster/shortreads/utils/bcbio-wrapper/ align.csv /cluster/ec2/cred

for EC2.

##Main function:

Run for usage instructions, which are duplicated here:

usage: [-h] [-t TEMPLATE] [-q QUEUE] [-i IMAGE] [-d DATADIR]
                [-n SLOTS] [-l] [-e] [-j] [-nd]

Automated bcbio-nextgen SGE submission script. The CSV input file must conform
to the bcbio-nextgen templating engine requirements, with a header row of
column names. Required columns are 'samplename' and 'genome_build'. The
'samplename' must be the prefix of the fastq (or bam) files (optionally
gzipped). The 'genome_build' must match a build name found in /cluster/bcbio-
nextgen/biodata/genomes/. The bam files will be named following the
'description' field, in separate directories. The aligner and its options are
selectable by the template YAML.

positional arguments:
  csv                   CSV file containing the experiment metadata.

optional arguments:
  -h, --help            show this help message and exit
  -t TEMPLATE, --template TEMPLATE
                        Template YAML file (see bcbio-nextgen docs).
  -q QUEUE, --queue QUEUE
                        SGE queue for job submission (default is batch).
  -i IMAGE, --image IMAGE
                        Docker container with bcbio installed (default is
  -d DATADIR, --datadir DATADIR
                        Directory with genome and annotation information
                        (default is /cluster/bcbio-nextgen/biodata).
  -n SLOTS, --slots SLOTS
                        Number of cores to request and use (default is 4).
  -l, --local           Run locally (not on SGE) (default is SGE).
  -e, --echo            Don't run, just print the command lines.
  -j, --joint           Analyze files jointly (useful for variant calling)
                        (default is no).
  -nd, --nodocker       Run alignment directly (outside of docker).

The basic idea is that you have a CSV file describing an experiment or
set of experiments and a YAML file describing a general approach to
alignment and processing. Several alignment templates developed for
different tasks have been put in the templates directory.

The CSV and YAML files are the same format as used by the
templating engine, so you can check out their

for more ideas and options. Specifically, most parameters listed
can be added as columns to your CSV file, and the pipeline will
process them and do the right thing.

You can comment out experiments you don't want to analyze in the CSV
with a leading #.

General usage

A metadata CSV has no fixed column order; the only requirement is that
columns are labeled in the first row. A few columns are required by
our pipeline.

An example:


The samplename column is a hint for the script to find the raw read
files (fastq, fastq.gz, bam, etc.). It will look for files that have
the row string as a prefix (including both _1 and _2 for paired end
data, for example). Be careful to not have names that are prefixes of
each other.

The description column is the label for the final bams in the
experiment directory. The batch column chooses which sets of
samples are variant-called (or mRNA quantified) together. The
quality_format string (illumina or standard) is required for
some pipelines, but bcbio-nextgen will crash if you specify it and
are wrong. The genome_build column chooses the genome reference.
Right now, hg19 and mm10 have the best support for all the
templates, but this will expand over time.

The script can be run in a joint mode (--joint) that analyzes all
samples together. This is typically only useful for runs with
downstream processing, like variant calling or mRNA quantification,
where multiple samples need to be processed together. The default
mode is to treat each run (line in the file) separately, creating an
output directory based on the description, lab, genome, and template.
This is the best choice for plain genome alignment since it offers the
most parallelization.


  • Basic alignment (bam output only):

    • bwa_align_template.yaml (alignment only)
    • bowtie2_chipseq_template.yaml (adapter trimming and bowtie2 alignment, see bcbio docs for how to specify adapter sequences)
  • RNA-seq (--joint will make multi-sample count tables, split by batch if you specify it in a column; otherwise single-sample tables can be merged afterwards):

    • tophat2_rnaseq_template.yaml
    • STAR_rnaseq_template.yaml
    • rnatest_rnaseq_template.yaml (possibly deprecated in newer versions of bcbio: bwa for alignment, followed by exon counting)
  • Variant calling (requires --joint and makes multi-sample VCF files with snpEff tagging):

    • bwa_variants_template.yaml
    • multi_variants_template.yaml (calls and annotates variants with Freebayes, Platypus, and Samtools)
    • multi_joint_variants_template.yaml (runs the same joint analysis but with a custom multi-sample merging step from bcbio.variation; requires --joint analysis)

Useful parameters to consider adding to your CSV (not all have been tested here):

  • disambiguate (removes reads that map to a second genome, useful for filtering)
  • strandedness ([unstranded, firststrand, secondstrand], useful for RNA-seq)
  • jointcaller
  • coverage_interval ([exome, genome, regional])
  • variant_regions (BED file of regions to call variants in)
  • svcaller
  • quality_format ([standard, illumina])
  • assemble_transcripts ([True, False], turns on de novo transcript discovery via Cufflinks with the Tophat RNA-seq template; slow)

Running pipeline

You give a CSV and (optionally) a YAML pipeline template
and it can execute the analysis you want in multiple ways. The
default is to submit jobs to SGE. You can specify a single queue (or
machine) with --queue. You can specify a local analysis (useful for
debugging or quick feedback) with --local. To see what commands
would be run without running them, use --echo, maybe along with
--nodocker to simplify the output.

Relative pathnames in the CSV are found based on the location of the
CSV file. Output directories (and SGE output logs, if applicable) are
created in the working directory when the program is executed (not
necessarily the same directory as the CSV).

Example alignment run (from within the examples directory)

Generate alignments command using:

../ gtex_test.csv --local --nodocker --echo | grep align_bcbio > commands.txt

commands.txt now contains the alignment run information (without the docker prefix).

Launch on SGE:

../ gtex_test.csv (alignment only)

../ gtex_test.csv --joint --template templates/STAR_rnaseq_template.yaml (joint mRNA quantification with STAR)

Launch on EC2: (email optional)

../ gtex_test.csv /cluster/ec2/cred

Log files will appear on /cluster/ec2

Internal setup

The Docker image used to run analyses (locally and on SGE) can be
created with the Dockerfile in this repository or with the container
in our fork of bcbio-nextgen. Push it to the Docker hub so that all
nodes can retrieve it (with the -i flag in

The data directory on permanent storage in /cluster/ is populated by
the script. This directory can be changed
with the -d option for

Docker Pull Command
Source Repository