Public | Automated Build

Last pushed: 2 years ago
Short Description
Automated builds of proovread containers
Full Description

#+LATEX: \pagebreak

What's new

  • revisited unitig mode with improved filter for repeat regions
  • dazzling proovread - experimental support for [[][DALIGNER]] as unitig mapper (faster/more sensitive than blasr)
    *** proovread-2.00
  • much faster and more sensitive due to bwa mem as default mapper
  • increased speed and contiguity by using unitigs in correction
  • optimized for HiSeq & MiSeq reads
  • compressed BAM intermediates
  • efficient threading up to 20 cores and more
  • ...
    ** Installation


fresh install

git clone --recursive
cd proovread/util/bwa

update existing install to latest version

cd proovread/
git pull
git submodule update --recursive
cd /util/bwa
make clean # important for changes to bwa source code taking effect


NOTE: proovread comes with its own, modified version of [[radio-bwa-proovread][bwa]]. Using it
with a standard bwa built will fail.

** Dependencies

proovread is distributed ready with binaries of [[][SHRiMP2]] and [[][BLASR]]. If you want
to employ your own installed version of these mappers, have a look at the
[[radio-advanced-configuration][Advanced Configuration]] section.

** Usage
Test your installation by running proovread on the included sample data set.

proovread --sample


Don't run proovread on entire SMRT cells directly, it will only blast your
memory and take forever. Split your data in handy chunks of a few Mbp first:


located in /path/to/proovread/bin

SeqChunker -s 20M -o pb-%03d.fq pb-subreads.fq


Run proovread on one chunk first.

proovread -l pb-001.fq -s reads.fq [-u unitigs.fa] --pre pb-001


If things go smoothly, submit the rest. See [[radio-log-and-statistics][Log and statistics]] for some notes on how to
interpret logging information.

Primarily proovread has been designed to correct /PacBio subreads/. You get
these reads either from PacBio's SMRT-Portal or by using =dextract= from Gene
Myers PacBio assembler [[][DAZZLER]], which I would recommend.

In general, reads can be provided in FASTQ or FASTA format. Quality information
is used, but only has minor advantages. More valuable are subread information
given in default PacBio IDs, which if available are utilized by proovreads
=ccseq= module to improve correction performance. Reads shorter then 2x the mean
short read length will be ignored.

It is also possible to feed other types of erroneous sequences to proovread,
e.g. contigs, 454 reads, ... However, keep in mind that the alignment model for
mappings has been optimized for PacBio reads and may produce artifacts in other
scenarios. We are currently working on a version optimized for /Oxford Nanopore/

*** short-reads
For correction of long reads, proovread needs high coverage short read
data. Typically these are HiSeq (75-150bp) and MiSeq reads (200-300bp), with
overlapping libraries merged ([[][FLASh]]) for best performance. But also 454 or
PacBio CCS reads can be used.

Reads need to have FASTQ/A format and may differ in length. Pairing information
are not used. Use of quality trimmed or error corrected reads can improve

The recommended coverage for short reads data is around 30-50X and should be
specified with =--coverage=. If you have less coverage, it is definitely still
worth running proovread. However, it is likely that contiguity will suffer.

Internally, proovread will sample subsets for different iterations, by default
15X for initial runs, 30X for the finishing. For customization of these rates
see =sr-coverage= in proovread's config ([[radio-advanced-configuration][Advanced Configuration]]).

*** unitigs
In addition to short reads, [[][unitigs]] can/should be used for correction in
particular for large data sets (eukaryotes). Unitigs are high-confidence
assembly fragments produced by for example ALLPATHS, Meraculous2 or the Celera
Assembler. In contrast to contigs, unitigs don't extend past any conflict in the
underlying short read data, making them highly reliable.

There are two huge advantages of using pre-computed unitigs:
1) Contiguity: unitigs are longer then corresponding short reads, which makes
them easier to align and give better chances to also correct difficult
2) Speed: During unitig computation, all redundancy is removed from the data,
creating a minimal set which can be aligned much faster.

However, unitigs only cover regions without conflicts in short read data
space. To correct PacBio reads in full length these gaps need to be corrected
with primary short read data.

** dazzling proovread - dazz2sam
Currently, support for DAZZLER/DALIGNER is considered experimental. To use
dazzler instead of blasr, either export paths or set =daligner-path= and
=dazz-db-path= in the config and invoke with modes
=sr+dazz-utg=/=mr+dazz-utg=. In the current implementation, only a single
instance of dazzler will be invoked, therefore threading is determined by the
thread setup with which daligner has been compiled (default 4).

Since proovread is designed to operate on BAM/SAM, for the time being, daligner
output is internally converted to SAM using a simple parser script
(=dazz2sam=). This script also works as a stand-alone tool for dazzler-to-SAM
conversion (=proovread/bin/dazz2sam --help=), which might come in handy if one
wants to visualize dazzler mappings in common alignment viewers like [[][IGV]] or

** extracting unitigs from ALLPATHS


extract unitigs from allpaths assembly

allpathslg/bin/Fastb2Fasta IN=reads.unibases.k96 OUT=unitigs.fa


** Output
By default, proovread generates six files in the output folder:

| .trimmed.f[aq] | high accuracy pacbio reads, trimmed for uncorrected/low quality regions |
| .untrimmed.fq | complete corrected pacbio reads including un-/ poorly corrected regions |
| .ignored.tsv | ids of reads and the reason for excluding them from correction |
| .chim.tsv | annotations of potential chimeric joints clipped during trimming |
| .parameter.log | the parameter set used for this run |

If you are interested in mappings (BAM) and other intermediary files from
iterations have a look at =--keep-temporary=.

The phred scores produced by proovread derive from short read support of each
base during correction. The values are scaled to realistically mimic sequencing
phred accuracies:

| Phred | Accuracy | p33 |
| 40 | 99.99 | I |
| 30 | 99.90 | ? |
| 20 | 99.00 | 5 |
| 10 | 90.00 | + |

** Log and statistics

proovread generates a comprehensive log on STDERR. The includes fully functional
system calls for scripts/tools invoked by proovread. That way, if something goes
wrong, its easy to rerun a certain task individually and take a closer look on the

If you want to analyze, how things are going and whether there might be problems
with sensitivity etc., the most important information is =Masked: xx%= after
each iteration.

grep -P 'Running mode|ked :|ning task' proovread.log
[Mon Jan 26 09:52:05 2015] Running mode: blasr-utg
[Mon Jan 26 09:52:51 2015] Running task blasr-utg
[Mon Jan 26 10:00:32 2015] Masked : 55.3%
[Mon Jan 26 10:00:32 2015] Running task bwa-mr-1
[Mon Jan 26 10:21:45 2015] Masked : 76.2%
[Mon Jan 26 10:28:14 2015] Running task bwa-mr-2
[Mon Jan 26 10:37:55 2015] Masked : 92.2%
[Mon Jan 26 10:39:46 2015] Running task bwa-mr-finish
[Mon Jan 26 10:51:19 2015] Masked : 93.0%


Masked regions are regions that have already been corrected at high
confidence, minus some edge fraction, which remains unmasked in order to
serve as seeds for subsequent iterations. After the first iteration, you should
have a masking percentage > 50-75%, strongly depending on quality, type and
coverage of your data. With each iteration, this value should increase.

Prior to the final iteration, all data is unmasked and the final iteration is
run with strict settings on entirely unmasked data. The obtained percentage can
be slightly lower as in the last iteration, and is roughly equal to the amount
of read bases that will make it to high-confidence .trimmed.fq output.

Advanced Configuration

proovread comes with a comprehensive configuration, which allows tuning down to
the algorithms core parameters. A custom configuration template can be generated
with =--create-cfg=. Instructions on format etc. can be found inside the
template file.

** Hardware and Parallelization
proovread has been designed with low memory node cluster architectures in
mind. Peek memory is mainly controlled by the amount of long reads
provided. With chunks of less than 20 Mbp it easily runs on a 8 GB RAM machine.

In theory, proovread can be simply parallelized by increasing
=--threads=. However, there are single thread steps and other bottlenecks, which
at some point render it more efficient, to run e.g. 4 instances at 8 threads in
parallel to make full use of a 32 CPU machine.

FAQ / General Remarks
Why do proovread results from two identical runs differ / Is proovread deterministic?

One might expect that proovread results are deterministic - meaning reproducible
in identical form if input data is identical. This, however, is not the case in
a couple of steps:

* bwa mem mappings
bwa employs heuristics that allow for slightly different
results in repeated runs. In particular, one feature is prone to generate
differences when employed in proovread's iterative strategy: for performance
reasons bwa encodes nucleotides using 2 bits only, meaning bwa only has a
four letter alphabet =[ATGC]=. Other bases, including =NNNN= stretches used
for masking by proovread, are converted into random =[ATGC]= strings. This
has the most severe effect on alignments at the margins of masked regions:

| |||||||
aln-01 | AATTGGT
| |||||||| ||


* sorting with threshold
whenever there are decisions to make based on some sort of sorted list in
combination with fixed amount of items to keep/remove, things get
non-deterministic, if different items can have identical values in their sorting
fields. In proovread, this for example affects filtering of "best alignments" in
bins (local context).

* consensus calling
Whenever there is a 50-50 ratio for a base call, the resulting base is randomly
chosen, to minimize particular biases.

** Algorithm and Implementation
Algorithm and Implementation are described in detail in the [[][proovread]] paper.

*** :noexport:
[[][view proovread mechanism poster]]

*** bwa-proovread

proovread does local score comparison, rather than using a single hard
cut-off. bwa-proovread is modified in the same fashion. =proovread.[ch]= extend
bwa with an implementation of proovread's binning algorithm. Reporting of
alignments is determined by score-comparison within bins. That way repeat
alignments are filtered early on, increasing performance and largely reducing
disk space requirements.

** Citing proovread
If you use proovread, please cite:

[[][proovread]]: large-scale high accuracy PacBio correction through iterative short
read consensus. Hackl, T.; Hedrich, R.; Schultz, J.; Foerster, F. (2014).

Please, also recognize the authors of software packages, employed by proovread:

Exploring single-sample SNP and INDEL calling with whole-genome de novo
assembly. Li H. (2012) ([[][bwa]])

Mapping single molecule sequencing reads using basic local alignment with
successive refinement ([[][BLASR]]): application and theory. Mark J Chaisson; Glenn
Tesler. (2012)

[[][SHRiMP]]: Accurate Mapping of Short Color-space Reads. Stephen M Rumble; Phil
Lacroute; Adrian V. Dalca; Marc Fiume; Arend Sidow; Michael Brudno. (2009)

** Contact
If you have any questions, encounter problems or potential bugs, don't hesitate
to contact us. Either report [[][issues]] on github or write an email to:

#+TITLE: proovread manual

#+AUTHOR: Thomas Hackl



#+OPTIONS: ^:nil date:nil H:2 todo:nil

#+LaTeX_CLASS: scrartcl

#+LaTeX_CLASS_OPTIONS: [a4paper,12pt,headings=small]

#+LaTeX_HEADER: \setlength{\parindent}{0pt}

#+LaTeX_HEADER: \setlength{\parskip}{1.5ex}

#+LATEX_HEADER: \renewcommand{\familydefault}{\sfdefault}

Docker Pull Command
Source Repository