Public Repository

Last pushed: a year ago
Short Description
Detailed Evaluation Creation and Analysis of Read Data
Full Description

Welcome to DECARD (Detailed Evaluation Creation and Analysis of Read Data).
What is DECARD?

DECARD can be used to simulate amplicon-based microbiome experiments, and test how classification software performs. DECARD can help with each step of the process:

Downloading references sequences from NCBI ( and SILVA (
Curating the reference sequences by GENUS and SPECIES. (
Performing quality control on the reference (
Generate targets for a set of communities (
Perform in-silico PCR and generate amplicon sequences (
(Optional). Perform in-silico sequencing of the amplicons, with error addition using something like ART

These amplicons or sequences can then be classified. The resultant classifications can be evaluated 1. For the accuracy of the OTU generation step ( 2. For the resolution and accuracy of the classification of each source organism (

Version 0.3

1. Install docker on your computer.

2. Grab the decard container:

docker pull golob/decard

3. Run the container.

docker run -it golob/decard

4. Change to our home directory.

cd ~

5. Make a directory structure to support our project.

mkdir projects/tutorial
mkdir projects/tutorial/inputs
mkdir projects/tutorial/outputs
mkdir projects/tutorial/analysis
mkdir projects/tutorial/working

(Note, these directories will not be saved when you quit the container. Docker will also let you mount a real directory on your computer's file system to the /root/proejcts directory of the container, to save your results.)

6. Update the NCBI 16S Microbial Reference.
(optional) references/NCBI_16S_microbial.fasta -e <your@email>

Replace <your@email> with a valid email address (this is as per a request from NCBI).
The container contains a cached copy of the NCBI 16 microbial database as of February 2017.
Updates to the reference are done by only downloading the new references. It's typical for this to take a moment.

7. Define the Community Composition

We need to define the community we wish to create. To do so, we need to specify for each genus we wish to include in our community the mean and standard deviation (SD) fractional abundance, as well as some parameters to determine how many specific organisms to select from this genus.

To determine how many organisms to include from a given genus, our CSV file can specify a log-model

a * log (f) + b 

where a and b are parameters and f is fractional abundance of this genus), or a simpler mean and standard deviation of number of species to include. DECARD enforces a minimum of one organism per genus represented in the community.

(Why are there two means of defining how many organisms to include per genus? Sometimes it can be hard to fit the log-normal model with real data, or the model doesn't fit particularly well. The number and standard deviation method is a back-up in this case.)

We define the community composition we're targeting with a CSV file, with the following columns:

Genus: Capitalized plain text genus name. (Bacteroides)
TaxID: NCBI taxonomy ID for this genus. (816)
Fraction: Fractional (or weighting) abundance, 0.0 - 1.0. DECARD will normalize the sum of all of these for a community to 1.0 (so you don't have to do this manually). (0.35)
STD: Standard deviation for the fractional abundance of this genus. (0.05)

Species_slope: For the model

Species_slope* log(f) + Species_intercept 

Species_intercept: For the model

Species_slope* log(f) + Species_intercept 

Species_p: The p-value for the fit for the log-normal model.

Species_n: Number of organisms to include for this genus
Species_std: Standard deviation for number of organisms for this genus

The example for the demo:

This file is available in the decard_data directory. You can put in the proper place by the command:

cp decard_data/Community.csv projects/tutorial/

8. Recruit sequences for each genus

First let's make a directory to store our recruited sequences

mkdir references/byGenus

Now we can recruit sequences from our NCBI reference for each genus we're adding to our community -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Bacteroides -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Faecalibacterium -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Alistipes -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Blautia -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Parabacteroides -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Eubacterium -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Prevotella -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Escherichia -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Streptococcus

9. Quality Control Sequences

Again, this needs to be done for each genus required for our community. This step makes sure there are no ambiguous bases, the sequences meet a minimum length, and throws out sequences that are inconsistent (names the same, but different from the majority).

This can be a slow and computationally intensive process. references/byGenus/Bacteroides/ references/byGenus/Faecalibacterium/ references/byGenus/Alistipes/ references/byGenus/Blautia/ references/byGenus/Parabacteroides/ references/byGenus/Eubacterium/ references/byGenus/Prevotella/ references/byGenus/Escherichia/ references/byGenus/Streptococcus/

10. Generate the synthetic Communities
We will now use our model and QC'ed 16S sequences to make some communities. This is in two steps. In the first step we generate a target file that lists specific 16S SSU accessions and their weight in the final community. We can then use this file in the next step to perform in-silico PCR and actually create read files to classify. By splitting these steps, we can try different PCR primers on the same synthetic community.

10a. Generate Targets
Here we will create 10 communities based on the community description we created above. \
-g references/byGenus/ \
-d projects/tutorial/Community.csv \
-n 10 \
-p Tut01 -o projects/tutorial/targets.csv

10b. Generate Sequences

Now we can perform in-silico PCR and generate read files. For this, we will need PCR primer sequences. For this example we will use those of the EMP. We will also need to specify a goal number of reads per community. \
--targets projects/tutorial/targets.csv \
--rRNA16SSU references/byGenus/  \
--primers decard_data/primers/Illumina_EMP_U515f_806R.fasta \
--count 5000 \
--map projects/tutorial/inputs/ \
--output projects/tutorial/inputs/tut01.emp5k.fasta

This will take a moment to run.

The map file will have a generated sequence ID, community ID, name, NCBI taxon ID and source accession. (Due to the random way by which sequences are picked, yours may not match exactly mine.) This file is the true mapping of amplicon to source organism that can be used later to assess how well a given classifier performed.

Tut01CM0SCRa2fd0f96c46a4a69a65c6b2ba079ec32,Tut01CM0,Bacteroides caecigallinarum,1411144,NR_145844.1
Tut01CM0SCR9940121f2321496b8307eda7aa7d0a59,Tut01CM0,Bacteroides caecigallinarum,1411144,NR_145844.1

The FASTA file will have the amplicons themselves.

>Tut01CM0SCRa2fd0f96c46a4a69a65c6b2ba079ec32 From NR_145844.1 description="Bacteroides caecigallinarum strain C13EG111 16S ribosomal RNA, partial sequence." organism="Bacteroides caecigallinarum" taxonomy="Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides" ncbi_tax_id="1411144"
>Tut01CM0SCR9940121f2321496b8307eda7aa7d0a59 From NR_145844.1 description="Bacteroides caecigallinarum strain C13EG111 16S ribosomal RNA, partial sequence." organism="Bacteroides caecigallinarum" taxonomy="Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides" ncbi_tax_id="1411144"

11. Generate a Sequence to community info file \
projects/tutorial/inputs/ \

12. Classify using these amplicon sequences
This step depends on the classifier, and settings used. One requires completely reduplicated results--where each and every amplicon sequence has an assigned OTU and classification.

We've written modules that can read in the output from QIIME and MOTHUR. To help with this tutorial, we've installed QIIME and the greengenes reference set for you.

Here's a brief set of steps to classify our reads with GreenGenes 97% OTU and closed OTU generation:

i) Make a fake mapping file:
This is to help with getting classification output for each sequence.

mkdir projects/tutorial/working/qiime
echo -e "#SampleID\tBarcodeSequence\tLinkerPrimerSequence\tDescription\ntut01\t\t\twithouterror" \
> projects/tutorial/working/qiime/map.txt

ii) Preprocess \
-m projects/tutorial/working/qiime/map.txt \
-f projects/tutorial/inputs/tut01.emp5k.fasta \
-o projects/tutorial/working/qiime -a 0 -b 0 -p

iii) Allow for per-sequence output (DECARD helper module) \
projects/tutorial/working/qiime/seqs.fna \

iv) Classify with closed OTU generation against the GreenGenes 97% OTU reference \
-i projects/tutorial/working/qiime/post.fna \
-o projects/tutorial/working/qiime/gg97closed \
-r references/gg_13_8_otus/rep_set/97_otus.fasta \
-t references/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt \
-s -f

v) Generate a DECARD-style OTU table from the BIOM format output \
projects/tutorial/working/qiime/gg97closed/otu_table.biom \
projects/tutorial/outputs/qiime_gg97closed.otutable.csv \
-s projects/tutorial/inputs/tut01.emp5k.seq_info.csv \

Please replace with your email address.

vi) Generate a phylogenetic tree

biom convert -i projects/tutorial/working/qiime/gg97closed/otu_table.biom \
-o projects/tutorial/working/qiime/gg97closed/otu_table.txt \
--to-tsv --header-key taxonomy -i references/gg_13_8_otus/trees/97_otus.tree \
-t projects/tutorial/working/qiime/gg97closed/otu_table.txt \
-o projects/tutorial/working/qiime/gg97closed/pruned.tre

13. Test OTU
We can use our classifier output (converted into our OTU table format) to test OTU generation fidelity.
To do so, for each pair of sequences we determine if they were from the same organism, and if so, are they correctly paired. Likewise, for sequences from different source organisms, we ask if they are appropriately split by the classifier.

mkdir projects/tutorial/analysis/otu projects/tutorial/inputs/ \
projects/tutorial/outputs/qiime_gg97closed.otutable.csv \

The output is in a CSV file


(Your output will be different from this, as the communities are randomly generated and yours will be different than mine)

14. Test Classification
Next up, testing classification!

mkdir projects/tutorial/analysis/classification \
projects/tutorial/inputs/ \ 
projects/tutorial/outputs/qiime_gg97closed.otutable.csv \
projects/tutorial/analysis/classification/qiime_gg97closed \
-s \

Output is in a few files. A JSON file is created with enough details for interactive visualization. CSV files are provided for species, genus and no targeted rank of classification. In each of these CSV files is the number of source organisms correctly classified, undercalled (in the correct clade, but not to the target resolution), overcalled (not relevant in this revision), miscalled sibling (correct parent, but wrong final rank), miscalled (down the wrong clade), and dropped.


15. Diversity Fidelity
We can now compare how well our classifier output matches the true diversity of the community, and create a spiffy figure along the way too!

mkdir projects/tutorial/analysis/diversity/ projects/tutorial/targets.csv \
--rRNA16SSU references/byGenus/ \
--seqinfo projects/tutorial/inputs/tut01.emp5k.seq_info.csv \
--statsfile projects/tutorial/analysis/diversity/diversity.xlsx \
--otutables projects/tutorial/outputs/qiime_gg97closed.otutable.csv \
projects/tutorial/analysis/diversity/diversity.xlsx \
projects/tutorial/analysis/diversity/diversity.pdf takes multiple OTU table outputs (to compare different pipelines or settings), calculates Shannon Diversity for each community from the output, and calculates a Shannon Diversity for the true population. The output is an Excel file. takes this Excel file and plots the true vs estimated Shannon alpha-diversity for each pipeline as a 2-D plot. It additionally calculates Spearman R^2 to test for monotonicity.

16. Pairwise-distance (beta-diversity) testing

First we need to create a 'true' phylogenetic tree from the full-length 16S sequences used to build our communities

mkdir projects/tutorial/tree projects/tutorial/targets.csv \
--output_dir projects/tutorial/tree/ \
--email \
--cm /usr/local/lib/python2.7/dist-packages/deenurp-0.1.8.dev59-py2.7.egg/deenurp/data/ \
--threads 1

cp projects/tutorial/tree/ projects/tutorial/targets.tre

Now we can proceed with comparing the true pairwise distances between the communities we generated, and the distances estimated by the classifiers.

To help with this, let's create a configuration file

echo -e "name,otutable,tree\nQIIME gg97 closed,projects/tutorial/outputs/qiime_gg97closed.otutable.csv,projects/tutorial/working/qiime/gg97closed/pruned.tre" > projects/tutorial/analysis/distance_config.csv \
projects/tutorial/targets.csv projects/tutorial/targets.tre \
--rRNA16SSU references/byGenus/  \
--seqinfo projects/tutorial/inputs/tut01.emp5k.seq_info.csv \
--figure projects/tutorial/analysis/distance.pdf \
--distfile projects/tutorial/analysis/distance.xlsx \
--dpcoa \
--config projects/tutorial/analysis/distance_config.csv

Two outputs are produced: An Excel File (with many subsheets) and a PDF with each distance method given a page, a 2D plot of true pairwise distance versus estimated pairwise distance from the classifier by that method.

Docker Pull Command