Welcome to DECARD (Detailed Evaluation Creation and Analysis of Read Data).
What is DECARD?
DECARD can be used to simulate amplicon-based microbiome experiments, and test how classification software performs. DECARD can help with each step of the process:
Downloading references sequences from NCBI (NCBI_16SMicrobial_update.py) and SILVA (SILVA_selectFromGenera.py) Curating the reference sequences by GENUS and SPECIES. (selectFromGenera.py) Performing quality control on the reference (genera_qc.py) Generate targets for a set of communities (generate_targets.py) Perform in-silico PCR and generate amplicon sequences (generate_sequences.py) (Optional). Perform in-silico sequencing of the amplicons, with error addition using something like ART
These amplicons or sequences can then be classified. The resultant classifications can be evaluated 1. For the accuracy of the OTU generation step (test_otu.py) 2. For the resolution and accuracy of the classification of each source organism (test_classification.py)
1. Install docker on your computer.
2. Grab the decard container:
docker pull golob/decard
3. Run the container.
docker run -it golob/decard
4. Change to our home directory.
5. Make a directory structure to support our project.
mkdir projects/tutorial mkdir projects/tutorial/inputs mkdir projects/tutorial/outputs mkdir projects/tutorial/analysis mkdir projects/tutorial/working
(Note, these directories will not be saved when you quit the container. Docker will also let you mount a real directory on your computer's file system to the /root/proejcts directory of the container, to save your results.)
6. Update the NCBI 16S Microbial Reference.
NCBI_16SMicrobial_update.py references/NCBI_16S_microbial.fasta -e <your@email>
Replace <your@email> with a valid email address (this is as per a request from NCBI).
The container contains a cached copy of the NCBI 16 microbial database as of February 2017.
Updates to the reference are done by only downloading the new references. It's typical for this to take a moment.
7. Define the Community Composition
We need to define the community we wish to create. To do so, we need to specify for each genus we wish to include in our community the mean and standard deviation (SD) fractional abundance, as well as some parameters to determine how many specific organisms to select from this genus.
To determine how many organisms to include from a given genus, our CSV file can specify a log-model
a * log (f) + b
where a and b are parameters and f is fractional abundance of this genus), or a simpler mean and standard deviation of number of species to include. DECARD enforces a minimum of one organism per genus represented in the community.
(Why are there two means of defining how many organisms to include per genus? Sometimes it can be hard to fit the log-normal model with real data, or the model doesn't fit particularly well. The number and standard deviation method is a back-up in this case.)
We define the community composition we're targeting with a CSV file, with the following columns:
Genus: Capitalized plain text genus name. (Bacteroides)
TaxID: NCBI taxonomy ID for this genus. (816)
Fraction: Fractional (or weighting) abundance, 0.0 - 1.0. DECARD will normalize the sum of all of these for a community to 1.0 (so you don't have to do this manually). (0.35)
STD: Standard deviation for the fractional abundance of this genus. (0.05)
Species_slope: For the model
Species_slope* log(f) + Species_intercept
Species_intercept: For the model
Species_slope* log(f) + Species_intercept
Species_p: The p-value for the fit for the log-normal model.
Species_n: Number of organisms to include for this genus
Species_std: Standard deviation for number of organisms for this genus
The example for the demo:
This file is available in the decard_data directory. You can put in the proper place by the command:
cp decard_data/Community.csv projects/tutorial/
8. Recruit sequences for each genus
First let's make a directory to store our recruited sequences
Now we can recruit sequences from our NCBI reference for each genus we're adding to our community
selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Bacteroides selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Faecalibacterium selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Alistipes selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Blautia selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Parabacteroides selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Eubacterium selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Prevotella selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Escherichia selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Streptococcus
9. Quality Control Sequences
Again, this needs to be done for each genus required for our community. This step makes sure there are no ambiguous bases, the sequences meet a minimum length, and throws out sequences that are inconsistent (names the same, but different from the majority).
This can be a slow and computationally intensive process.
genera_qc.py references/byGenus/Bacteroides/ genera_qc.py references/byGenus/Faecalibacterium/ genera_qc.py references/byGenus/Alistipes/ genera_qc.py references/byGenus/Blautia/ genera_qc.py references/byGenus/Parabacteroides/ genera_qc.py references/byGenus/Eubacterium/ genera_qc.py references/byGenus/Prevotella/ genera_qc.py references/byGenus/Escherichia/ genera_qc.py references/byGenus/Streptococcus/
10. Generate the synthetic Communities
We will now use our model and QC'ed 16S sequences to make some communities. This is in two steps. In the first step we generate a target file that lists specific 16S SSU accessions and their weight in the final community. We can then use this file in the next step to perform in-silico PCR and actually create read files to classify. By splitting these steps, we can try different PCR primers on the same synthetic community.
10a. Generate Targets
Here we will create 10 communities based on the community description we created above.
generate_targets.py \ -g references/byGenus/ \ -d projects/tutorial/Community.csv \ -n 10 \ -p Tut01 -o projects/tutorial/targets.csv
10b. Generate Sequences
Now we can perform in-silico PCR and generate read files. For this, we will need PCR primer sequences. For this example we will use those of the EMP. We will also need to specify a goal number of reads per community.
generate_sequences.py \ --targets projects/tutorial/targets.csv \ --rRNA16SSU references/byGenus/ \ --primers decard_data/primers/Illumina_EMP_U515f_806R.fasta \ --count 5000 \ --map projects/tutorial/inputs/tut01.emp5k.map.csv \ --output projects/tutorial/inputs/tut01.emp5k.fasta
This will take a moment to run.
The map file will have a generated sequence ID, community ID, name, NCBI taxon ID and source accession. (Due to the random way by which sequences are picked, yours may not match exactly mine.) This file is the true mapping of amplicon to source organism that can be used later to assess how well a given classifier performed.
seqID,community,organism,ncbi_tax_id,sourceSeq Tut01CM0SCRa2fd0f96c46a4a69a65c6b2ba079ec32,Tut01CM0,Bacteroides caecigallinarum,1411144,NR_145844.1 Tut01CM0SCR9940121f2321496b8307eda7aa7d0a59,Tut01CM0,Bacteroides caecigallinarum,1411144,NR_145844.1
The FASTA file will have the amplicons themselves.
>Tut01CM0SCRa2fd0f96c46a4a69a65c6b2ba079ec32 From NR_145844.1 description="Bacteroides caecigallinarum strain C13EG111 16S ribosomal RNA, partial sequence." organism="Bacteroides caecigallinarum" taxonomy="Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides" ncbi_tax_id="1411144" GTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAA GGGAGCGTAGACGGGATGTTAAGTCAGCTGTGAAAGTTTGGGGCTCAACCTTAAAATTGC AGTTGAAACTGGCATTCTTGAGTGCGGCAGAGGCAGGCGGAATTCGTGGTGTAGCGGTGA AATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGAGCGTAACTGAC GTTGATGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCC >Tut01CM0SCR9940121f2321496b8307eda7aa7d0a59 From NR_145844.1 description="Bacteroides caecigallinarum strain C13EG111 16S ribosomal RNA, partial sequence." organism="Bacteroides caecigallinarum" taxonomy="Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides" ncbi_tax_id="1411144" GTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAA GGGAGCGTAGACGGGATGTTAAGTCAGCTGTGAAAGTTTGGGGCTCAACCTTAAAATTGC AGTTGAAACTGGCATTCTTGAGTGCGGCAGAGGCAGGCGGAATTCGTGGTGTAGCGGTGA
11. Generate a Sequence to community info file
map_to_seq_info.py \ projects/tutorial/inputs/tut01.emp5k.map.csv \ projects/tutorial/inputs/tut01.emp5k.seq_info.csv
12. Classify using these amplicon sequences
This step depends on the classifier, and settings used. One requires completely reduplicated results--where each and every amplicon sequence has an assigned OTU and classification.
We've written modules that can read in the output from QIIME and MOTHUR. To help with this tutorial, we've installed QIIME and the greengenes reference set for you.
Here's a brief set of steps to classify our reads with GreenGenes 97% OTU and closed OTU generation:
i) Make a fake mapping file:
This is to help with getting classification output for each sequence.
mkdir projects/tutorial/working/qiime echo -e "#SampleID\tBarcodeSequence\tLinkerPrimerSequence\tDescription\ntut01\t\t\twithouterror" \ > projects/tutorial/working/qiime/map.txt
split_libraries.py \ -m projects/tutorial/working/qiime/map.txt \ -f projects/tutorial/inputs/tut01.emp5k.fasta \ -o projects/tutorial/working/qiime -a 0 -b 0 -p
iii) Allow for per-sequence output (DECARD helper module)
QIIME_preprocess_bySeq.py \ projects/tutorial/working/qiime/seqs.fna \ projects/tutorial/working/qiime/post.fna
iv) Classify with closed OTU generation against the GreenGenes 97% OTU reference
pick_closed_reference_otus.py \ -i projects/tutorial/working/qiime/post.fna \ -o projects/tutorial/working/qiime/gg97closed \ -r references/gg_13_8_otus/rep_set/97_otus.fasta \ -t references/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt \ -s -f
v) Generate a DECARD-style OTU table from the BIOM format output
QIIME_generate_otu_table.py \ projects/tutorial/working/qiime/gg97closed/otu_table.biom \ projects/tutorial/outputs/qiime_gg97closed.otutable.csv \ -s projects/tutorial/inputs/tut01.emp5k.seq_info.csv \ -e firstname.lastname@example.org
Please replace email@example.com with your email address.
vi) Generate a phylogenetic tree
biom convert -i projects/tutorial/working/qiime/gg97closed/otu_table.biom \ -o projects/tutorial/working/qiime/gg97closed/otu_table.txt \ --to-tsv --header-key taxonomy filter_tree.py -i references/gg_13_8_otus/trees/97_otus.tree \ -t projects/tutorial/working/qiime/gg97closed/otu_table.txt \ -o projects/tutorial/working/qiime/gg97closed/pruned.tre
13. Test OTU
We can use our classifier output (converted into our OTU table format) to test OTU generation fidelity.
To do so, for each pair of sequences we determine if they were from the same organism, and if so, are they correctly paired. Likewise, for sequences from different source organisms, we ask if they are appropriately split by the classifier.
mkdir projects/tutorial/analysis/otu test_otu.py projects/tutorial/inputs/tut01.emp5k.map.csv \ projects/tutorial/outputs/qiime_gg97closed.otutable.csv \ projects/tutorial/analysis/otu/qiime_gg97closed.otu_test.csv
The output is in a CSV file
Community,Correct_Match,Correct_Split,Incorrect_Match,Incorrect_Split,Dropped,Total,sensitivity,specificity,ppv,npv,accuracy Tut01CM0,1033292,11409274,0,0,1,4990,1.0,1.0,1.0,1.0,1.0 Tut01CM1,1144261,11233407,84860,0,0,4993,1.0,0.9925023857450969,0.930958790875756,1.0,0.9931907876154822 Tut01CM2,776520,11611139,59896,0,0,4990,1.0,0.9948679787182542,0.9283897008187314,1.0,0.995188131323782 ....
(Your output will be different from this, as the communities are randomly generated and yours will be different than mine)
14. Test Classification
Next up, testing classification!
mkdir projects/tutorial/analysis/classification test_classification.py \ projects/tutorial/inputs/tut01.emp5k.map.csv \ projects/tutorial/outputs/qiime_gg97closed.otutable.csv \ projects/tutorial/analysis/classification/qiime_gg97closed \ -s \ -e firstname.lastname@example.org
Output is in a few files. A JSON file is created with enough details for interactive visualization. CSV files are provided for species, genus and no targeted rank of classification. In each of these CSV files is the number of source organisms correctly classified, undercalled (in the correct clade, but not to the target resolution), overcalled (not relevant in this revision), miscalled sibling (correct parent, but wrong final rank), miscalled (down the wrong clade), and dropped.
community,correct,undercalled,overcalled,miscalled_sibs,miscalled,dropped Tut01CM0,15.998914223669924,2.0,0.0,0.0,2.0,0.0010857763300760044 Tut01CM1,16.0,0.0,0.0,0.0,2.0,0.0 Tut01CM2,21.0,6.0,0.0,0.0,0.0,0.0 Tut01CM3,20.0,2.0,0.0,0.0,2.0,1.0 Tut01CM4,20.0,4.0,0.0,0.0,1.0,0.0 Tut01CM5,22.0,1.0,0.0,0.0,1.0,0.0 Tut01CM6,15.0,1.0,0.0,0.0,1.0,0.0 Tut01CM7,15.0,3.0,0.0,0.0,3.0,0.0 Tut01CM8,12.0,3.0,0.0,0.0,0.0,0.0 Tut01CM9,16.0,5.0,0.0,0.0,3.0,0.0
15. Diversity Fidelity
We can now compare how well our classifier output matches the true diversity of the community, and create a spiffy figure along the way too!
mkdir projects/tutorial/analysis/diversity/ otutable_diversity_comparison.py projects/tutorial/targets.csv \ --rRNA16SSU references/byGenus/ \ --seqinfo projects/tutorial/inputs/tut01.emp5k.seq_info.csv \ --statsfile projects/tutorial/analysis/diversity/diversity.xlsx \ --otutables projects/tutorial/outputs/qiime_gg97closed.otutable.csv diversity_figure.py \ projects/tutorial/analysis/diversity/diversity.xlsx \ projects/tutorial/analysis/diversity/diversity.pdf
otutable_diversity_comparison.py takes multiple OTU table outputs (to compare different pipelines or settings), calculates Shannon Diversity for each community from the output, and calculates a Shannon Diversity for the true population. The output is an Excel file.
diversity_figure.py takes this Excel file and plots the true vs estimated Shannon alpha-diversity for each pipeline as a 2-D plot. It additionally calculates Spearman R^2 to test for monotonicity.
16. Pairwise-distance (beta-diversity) testing
First we need to create a 'true' phylogenetic tree from the full-length 16S sequences used to build our communities
mkdir projects/tutorial/tree target_to_phylogenetic_tree.py projects/tutorial/targets.csv \ --output_dir projects/tutorial/tree/ \ --email email@example.com \ --cm /usr/local/lib/python2.7/dist-packages/deenurp-0.1.8.dev59-py2.7.egg/deenurp/data/RRNA_16S_BACTERIA.cm \ --threads 1 cp projects/tutorial/tree/RAxML_bestTree.target.tre projects/tutorial/targets.tre
Now we can proceed with comparing the true pairwise distances between the communities we generated, and the distances estimated by the classifiers.
To help with this, let's create a configuration file
echo -e "name,otutable,tree\nQIIME gg97 closed,projects/tutorial/outputs/qiime_gg97closed.otutable.csv,projects/tutorial/working/qiime/gg97closed/pruned.tre" > projects/tutorial/analysis/distance_config.csv
otutable_distance_comparison.py \ projects/tutorial/targets.csv projects/tutorial/targets.tre \ --rRNA16SSU references/byGenus/ \ --seqinfo projects/tutorial/inputs/tut01.emp5k.seq_info.csv \ --figure projects/tutorial/analysis/distance.pdf \ --distfile projects/tutorial/analysis/distance.xlsx \ --dpcoa \ --config projects/tutorial/analysis/distance_config.csv
Two outputs are produced: An Excel File (with many subsheets) and a PDF with each distance method given a page, a 2D plot of true pairwise distance versus estimated pairwise distance from the classifier by that method.