- revisited unitig mode with improved filter for repeat regions
- dazzling proovread - experimental support for [[https://dazzlerblog.wordpress.com/2014/12/31/damapper-and-other-dazzler-upgrades/][DALIGNER]] as unitig mapper (faster/more sensitive than blasr)
- much faster and more sensitive due to bwa mem as default mapper
- increased speed and contiguity by using unitigs in correction
- optimized for HiSeq & MiSeq reads
- compressed BAM intermediates
- efficient threading up to 20 cores and more
git clone --recursive https://github.com/BioInf-Wuerzburg/proovread
update existing install to latest version
git submodule update --recursive
make clean # important for changes to bwa source code taking effect
NOTE: proovread comes with its own, modified version of [[radio-bwa-proovread][bwa]]. Using it
with a standard bwa built will fail.
- Perl 5.10 or later
- [[ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast%2B/LATEST/][NCBI Blast-2.2.24+]] or later
- [[http://sourceforge.net/projects/samtools/files/samtools/][samtools-1.1]] or later
proovread is distributed ready with binaries of [[http://compbio.cs.toronto.edu/shrimp/shrimp][SHRiMP2]] and [[https://github.com/PacificBiosciences/blasr][BLASR]]. If you want
to employ your own installed version of these mappers, have a look at the
[[radio-advanced-configuration][Advanced Configuration]] section.
Test your installation by running proovread on the included sample data set.
Don't run proovread on entire SMRT cells directly, it will only blast your
memory and take forever. Split your data in handy chunks of a few Mbp first:
located in /path/to/proovread/bin
SeqChunker -s 20M -o pb-%03d.fq pb-subreads.fq
Run proovread on one chunk first.
proovread -l pb-001.fq -s reads.fq [-u unitigs.fa] --pre pb-001
If things go smoothly, submit the rest. See [[radio-log-and-statistics][Log and statistics]] for some notes on how to
interpret logging information.
Primarily proovread has been designed to correct /PacBio subreads/. You get
these reads either from PacBio's SMRT-Portal or by using =dextract= from Gene
Myers PacBio assembler [[http://dazzlerblog.wordpress.com/2014/03/22/the-dextractor-module-save-disk-space-for-your-pacbio-projects/][DAZZLER]], which I would recommend.
In general, reads can be provided in FASTQ or FASTA format. Quality information
is used, but only has minor advantages. More valuable are subread information
given in default PacBio IDs, which if available are utilized by proovreads
=ccseq= module to improve correction performance. Reads shorter then 2x the mean
short read length will be ignored.
It is also possible to feed other types of erroneous sequences to proovread,
e.g. contigs, 454 reads, ... However, keep in mind that the alignment model for
mappings has been optimized for PacBio reads and may produce artifacts in other
scenarios. We are currently working on a version optimized for /Oxford Nanopore/
For correction of long reads, proovread needs high coverage short read
data. Typically these are HiSeq (75-150bp) and MiSeq reads (200-300bp), with
overlapping libraries merged ([[http://ccb.jhu.edu/software/FLASH/][FLASh]]) for best performance. But also 454 or
PacBio CCS reads can be used.
Reads need to have FASTQ/A format and may differ in length. Pairing information
are not used. Use of quality trimmed or error corrected reads can improve
The recommended coverage for short reads data is around 30-50X and should be
specified with =--coverage=. If you have less coverage, it is definitely still
worth running proovread. However, it is likely that contiguity will suffer.
Internally, proovread will sample subsets for different iterations, by default
15X for initial runs, 30X for the finishing. For customization of these rates
see =sr-coverage= in proovread's config ([[radio-advanced-configuration][Advanced Configuration]]).
In addition to short reads, [[http://wgs-assembler.sourceforge.net/wiki/index.php/Celera_Assembler_Terminology][unitigs]] can/should be used for correction in
particular for large data sets (eukaryotes). Unitigs are high-confidence
assembly fragments produced by for example ALLPATHS, Meraculous2 or the Celera
Assembler. In contrast to contigs, unitigs don't extend past any conflict in the
underlying short read data, making them highly reliable.
There are two huge advantages of using pre-computed unitigs:
1) Contiguity: unitigs are longer then corresponding short reads, which makes
them easier to align and give better chances to also correct difficult
2) Speed: During unitig computation, all redundancy is removed from the data,
creating a minimal set which can be aligned much faster.
However, unitigs only cover regions without conflicts in short read data
space. To correct PacBio reads in full length these gaps need to be corrected
with primary short read data.
** dazzling proovread - dazz2sam
Currently, support for DAZZLER/DALIGNER is considered experimental. To use
dazzler instead of blasr, either export paths or set =daligner-path= and
=dazz-db-path= in the config and invoke with modes
=sr+dazz-utg=/=mr+dazz-utg=. In the current implementation, only a single
instance of dazzler will be invoked, therefore threading is determined by the
thread setup with which daligner has been compiled (default 4).
Since proovread is designed to operate on BAM/SAM, for the time being, daligner
output is internally converted to SAM using a simple parser script
(=dazz2sam=). This script also works as a stand-alone tool for dazzler-to-SAM
conversion (=proovread/bin/dazz2sam --help=), which might come in handy if one
wants to visualize dazzler mappings in common alignment viewers like [[http://www.broadinstitute.org/igv/][IGV]] or
** extracting unitigs from ALLPATHS
extract unitigs from allpaths assembly
allpathslg/bin/Fastb2Fasta IN=reads.unibases.k96 OUT=unitigs.fa
By default, proovread generates six files in the output folder:
| .trimmed.f[aq] | high accuracy pacbio reads, trimmed for uncorrected/low quality regions |
| .untrimmed.fq | complete corrected pacbio reads including un-/ poorly corrected regions |
| .ignored.tsv | ids of reads and the reason for excluding them from correction |
| .chim.tsv | annotations of potential chimeric joints clipped during trimming |
| .parameter.log | the parameter set used for this run |
If you are interested in mappings (BAM) and other intermediary files from
iterations have a look at =--keep-temporary=.
The phred scores produced by proovread derive from short read support of each
base during correction. The values are scaled to realistically mimic sequencing
| Phred | Accuracy | p33 |
| 40 | 99.99 | I |
| 30 | 99.90 | ? |
| 20 | 99.00 | 5 |
| 10 | 90.00 | + |
** Log and statistics
proovread generates a comprehensive log on STDERR. The includes fully functional
system calls for scripts/tools invoked by proovread. That way, if something goes
wrong, its easy to rerun a certain task individually and take a closer look on the
If you want to analyze, how things are going and whether there might be problems
with sensitivity etc., the most important information is =Masked: xx%= after
grep -P 'Running mode|ked :|ning task' proovread.log
[Mon Jan 26 09:52:05 2015] Running mode: blasr-utg
[Mon Jan 26 09:52:51 2015] Running task blasr-utg
[Mon Jan 26 10:00:32 2015] Masked : 55.3%
[Mon Jan 26 10:00:32 2015] Running task bwa-mr-1
[Mon Jan 26 10:21:45 2015] Masked : 76.2%
[Mon Jan 26 10:28:14 2015] Running task bwa-mr-2
[Mon Jan 26 10:37:55 2015] Masked : 92.2%
[Mon Jan 26 10:39:46 2015] Running task bwa-mr-finish
[Mon Jan 26 10:51:19 2015] Masked : 93.0%
Masked regions are regions that have already been corrected at high
confidence, minus some edge fraction, which remains unmasked in order to
serve as seeds for subsequent iterations. After the first iteration, you should
have a masking percentage > 50-75%, strongly depending on quality, type and
coverage of your data. With each iteration, this value should increase.
Prior to the final iteration, all data is unmasked and the final iteration is
run with strict settings on entirely unmasked data. The obtained percentage can
be slightly lower as in the last iteration, and is roughly equal to the amount
of read bases that will make it to high-confidence .trimmed.fq output.
proovread comes with a comprehensive configuration, which allows tuning down to
the algorithms core parameters. A custom configuration template can be generated
with =--create-cfg=. Instructions on format etc. can be found inside the
** Hardware and Parallelization
proovread has been designed with low memory node cluster architectures in
mind. Peek memory is mainly controlled by the amount of long reads
provided. With chunks of less than 20 Mbp it easily runs on a 8 GB RAM machine.
In theory, proovread can be simply parallelized by increasing
=--threads=. However, there are single thread steps and other bottlenecks, which
at some point render it more efficient, to run e.g. 4 instances at 8 threads in
parallel to make full use of a 32 CPU machine.
FAQ / General Remarks
** Why do proovread results from two identical runs differ / Is proovread deterministic?
One might expect that proovread results are deterministic - meaning reproducible
in identical form if input data is identical. This, however, is not the case in
a couple of steps:
* bwa mem mappings
bwa employs heuristics that allow for slightly different
results in repeated runs. In particular, one feature is prone to generate
differences when employed in proovread's iterative strategy: for performance
reasons bwa encodes nucleotides using 2 bits only, meaning bwa only has a
four letter alphabet =[ATGC]=. Other bases, including =NNNN= stretches used
for masking by proovread, are converted into random =[ATGC]= strings. This
has the most severe effect on alignments at the margins of masked regions:
orig | ATGAATTGGTTAATCTGC
masked | ATGAATTGGTNNNNNNNN
read | AATTGGTTAAT
rand-01 | ATGAATTGGTAGCCATGG
aln-01 | AATTGGT
rand-02 | ATGAATTGGTTTATCTGC
| |||||||| ||
aln-02 | AATTGGTTAAT
* sorting with threshold
whenever there are decisions to make based on some sort of sorted list in
combination with fixed amount of items to keep/remove, things get
non-deterministic, if different items can have identical values in their sorting
fields. In proovread, this for example affects filtering of "best alignments" in
bins (local context).
* consensus calling
Whenever there is a 50-50 ratio for a base call, the resulting base is randomly
chosen, to minimize particular biases.
** Algorithm and Implementation
Algorithm and Implementation are described in detail in the [[http://dx.doi.org/10.1093/bioinformatics/btu392][proovread]] paper.
[[https://github.com/BioInf-Wuerzburg/proovread/blob/master/media/proovread-poster.pdf][view proovread mechanism poster]]
proovread does local score comparison, rather than using a single hard
cut-off. bwa-proovread is modified in the same fashion. =proovread.[ch]= extend
bwa with an implementation of proovread's binning algorithm. Reporting of
alignments is determined by score-comparison within bins. That way repeat
alignments are filtered early on, increasing performance and largely reducing
disk space requirements.
** Citing proovread
If you use proovread, please cite:
[[http://dx.doi.org/10.1093/bioinformatics/btu392][proovread]]: large-scale high accuracy PacBio correction through iterative short
read consensus. Hackl, T.; Hedrich, R.; Schultz, J.; Foerster, F. (2014).
Please, also recognize the authors of software packages, employed by proovread:
Exploring single-sample SNP and INDEL calling with whole-genome de novo
assembly. Li H. (2012) ([[http://dx.doi.org/10.1093/bioinformatics/bts280][bwa]])
Mapping single molecule sequencing reads using basic local alignment with
successive refinement ([[http://dx.doi.org/10.1186/1471-2105-13-238][BLASR]]): application and theory. Mark J Chaisson; Glenn
[[http://dx.doi.org/10.1371/journal.pcbi.1000386][SHRiMP]]: Accurate Mapping of Short Color-space Reads. Stephen M Rumble; Phil
Lacroute; Adrian V. Dalca; Marc Fiume; Arend Sidow; Michael Brudno. (2009)
If you have any questions, encounter problems or potential bugs, don't hesitate
to contact us. Either report [[https://github.com/BioInf-Wuerzburg/proovread/issues][issues]] on github or write an email to:
- Thomas Hackl - firstname.lastname@example.org
- Frank Foerster - email@example.com
#+TITLE: proovread manual
#+AUTHOR: Thomas Hackl
#+OPTIONS: ^:nil date:nil H:2 todo:nil