Public Repository

Last pushed: 7 months ago
Short Description
Detailed Evaluation Creation and Analysis of Read Data
Full Description

Welcome to DECARD (Detailed Evaluation Creation and Analysis of Read Data).
What is DECARD?

DECARD can be used to simulate amplicon-based microbiome experiments, and test how classification software performs. DECARD can help with each step of the process:

Downloading references sequences from NCBI (NCBI_16SMicrobial_update.py) and SILVA (SILVA_selectFromGenera.py)
Curating the reference sequences by GENUS and SPECIES. (selectFromGenera.py)
Performing quality control on the reference (genera_qc.py)
Generate targets for a set of communities (generate_targets.py)
Perform in-silico PCR and generate amplicon sequences (generate_sequences.py)
(Optional). Perform in-silico sequencing of the amplicons, with error addition using something like ART

These amplicons or sequences can then be classified. The resultant classifications can be evaluated 1. For the accuracy of the OTU generation step (test_otu.py) 2. For the resolution and accuracy of the classification of each source organism (test_classification.py)

Version 0.3

Tutorial:
1. Install docker on your computer.
https://docs.docker.com/engine/installation/

2. Grab the decard container:

docker pull golob/decard

3. Run the container.

docker run -it golob/decard

4. Change to our home directory.

cd ~

5. Make a directory structure to support our project.

mkdir projects/tutorial
mkdir projects/tutorial/inputs
mkdir projects/tutorial/outputs
mkdir projects/tutorial/analysis
mkdir projects/tutorial/working

(Note, these directories will not be saved when you quit the container. Docker will also let you mount a real directory on your computer's file system to the /root/proejcts directory of the container, to save your results.)

6. Update the NCBI 16S Microbial Reference.
(optional)

NCBI_16SMicrobial_update.py references/NCBI_16S_microbial.fasta -e <your@email>

Replace your@email with a valid email address (this is as per a request from NCBI).
The container contains a cached copy of the NCBI 16 microbial database as of February 2017.
Updates to the reference are done by only downloading the new references. It's typical for this to take a moment.

7. Define the Community Composition

We need to define the community we wish to create. To do so, we need to specify for each genus we wish to include in our community the mean and standard deviation (SD) fractional abundance, as well as some parameters to determine how many specific organisms to select from this genus.

To determine how many organisms to include from a given genus, our CSV file can specify a log-model

a * log (f) + b

where a and b are parameters and f is fractional abundance of this genus), or a simpler mean and standard deviation of number of species to include. DECARD enforces a minimum of one organism per genus represented in the community.

(Why are there two means of defining how many organisms to include per genus? Sometimes it can be hard to fit the log-normal model with real data, or the model doesn't fit particularly well. The number and standard deviation method is a back-up in this case.)

We define the community composition we're targeting with a CSV file, with the following columns:

Genus: Capitalized plain text genus name. (Bacteroides)
TaxID: NCBI taxonomy ID for this genus. (816)
Fraction: Fractional (or weighting) abundance, 0.0 - 1.0. DECARD will normalize the sum of all of these for a community to 1.0 (so you don't have to do this manually). (0.35)
STD: Standard deviation for the fractional abundance of this genus. (0.05)

Species_slope: For the model

Species_slope* log(f) + Species_intercept

Species_intercept: For the model

Species_slope* log(f) + Species_intercept

Species_p: The p-value for the fit for the log-normal model.

Species_n: Number of organisms to include for this genus
Species_std: Standard deviation for number of organisms for this genus

The example for the demo:
Genus,TaxID,Fraction,STD,Species_n,Species_std,Species_slope,Species_intercept,Species_p
Bacteroides,816,0.354,0.234,9.4,2.44,0.87,10.61,0.01
Faecalibacterium,216851,0.079,0.08,1,0,0,1,1
Alistipes,239759,0.077,0.096,1.5,0.55,0.21,2.28,0
Blautia,572511,0.042,0.094,2.8,1.35,0.69,5.95,0
Parabacteroides,375288,0.036,0.035,1.8,0.65,0.15,2.32,0.15
Eubacterium,1730,0.035,0.042,1.5,0.6,0.18,2.27,0
Prevotella,838,0.031,0.087,1.4,0.98,0.19,2.04,0.2
Escherichia,561,0.028,0.096,2,1.24,0.16,2.87,0.25
Streptococcus,1301,0.026,0.112,2.2,1.28,0.47,5.08,0

This file is available in the decard_data directory. You can put in the proper place by the command:

cp decard_data/Community.csv projects/tutorial/

8. Recruit sequences for each genus

First let's make a directory to store our recruited sequences

mkdir references/byGenus

Now we can recruit sequences from our NCBI reference for each genus we're adding to our community

selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Bacteroides
selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Faecalibacterium
selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Alistipes
selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Blautia
selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Parabacteroides
selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Eubacterium
selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Prevotella
selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Escherichia
selectFromGenera.py -d references/byGenus/ -r references/NCBI_16S_microbial.fasta -g Streptococcus

9. Quality Control Sequences

Again, this needs to be done for each genus required for our community. This step makes sure there are no ambiguous bases, the sequences meet a minimum length, and throws out sequences that are inconsistent (names the same, but different from the majority).

This can be a slow and computationally intensive process.

genera_qc.py references/byGenus/Bacteroides/
genera_qc.py references/byGenus/Faecalibacterium/
genera_qc.py references/byGenus/Alistipes/
genera_qc.py references/byGenus/Blautia/
genera_qc.py references/byGenus/Parabacteroides/
genera_qc.py references/byGenus/Eubacterium/
genera_qc.py references/byGenus/Prevotella/
genera_qc.py references/byGenus/Escherichia/
genera_qc.py references/byGenus/Streptococcus/

10. Generate the synthetic Communities
We will now use our model and QC'ed 16S sequences to make some communities. This is in two steps. In the first step we generate a target file that lists specific 16S SSU accessions and their weight in the final community. We can then use this file in the next step to perform in-silico PCR and actually create read files to classify. By splitting these steps, we can try different PCR primers on the same synthetic community.

10a. Generate Targets
Here we will create 10 communities based on the community description we created above.

generate_targets.py \
-g references/byGenus/ \
-d projects/tutorial/Community.csv \
-n 10 \
-p Tut01 -o projects/tutorial/targets.csv

10b. Generate Sequences

Now we can perform in-silico PCR and generate read files. For this, we will need PCR primer sequences. For this example we will use those of the EMP. We will also need to specify a goal number of reads per community.

generate_sequences.py \
--targets projects/tutorial/targets.csv \
--rRNA16SSU references/byGenus/  \
--primers decard_data/primers/Illumina_EMP_U515f_806R.fasta \
--count 5000 \
--map projects/tutorial/inputs/tut01.emp5k.map.csv \
--output projects/tutorial/inputs/tut01.emp5k.fasta

This will take a moment to run.

The map file will have a generated sequence ID, community ID, name, NCBI taxon ID and source accession. (Due to the random way by which sequences are picked, yours may not match exactly mine.) This file is the true mapping of amplicon to source organism that can be used later to assess how well a given classifier performed.

seqID,community,organism,ncbi_tax_id,sourceSeq
Tut01CM0SCRa2fd0f96c46a4a69a65c6b2ba079ec32,Tut01CM0,Bacteroides caecigallinarum,1411144,NR_145844.1
Tut01CM0SCR9940121f2321496b8307eda7aa7d0a59,Tut01CM0,Bacteroides caecigallinarum,1411144,NR_145844.1

The FASTA file will have the amplicons themselves.

>Tut01CM0SCRa2fd0f96c46a4a69a65c6b2ba079ec32 From NR_145844.1 description="Bacteroides caecigallinarum strain C13EG111 16S ribosomal RNA, partial sequence." organism="Bacteroides caecigallinarum" taxonomy="Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides" ncbi_tax_id="1411144"
GTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAA
GGGAGCGTAGACGGGATGTTAAGTCAGCTGTGAAAGTTTGGGGCTCAACCTTAAAATTGC
AGTTGAAACTGGCATTCTTGAGTGCGGCAGAGGCAGGCGGAATTCGTGGTGTAGCGGTGA
AATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGAGCGTAACTGAC
GTTGATGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCC
>Tut01CM0SCR9940121f2321496b8307eda7aa7d0a59 From NR_145844.1 description="Bacteroides caecigallinarum strain C13EG111 16S ribosomal RNA, partial sequence." organism="Bacteroides caecigallinarum" taxonomy="Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides" ncbi_tax_id="1411144"
GTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAA
GGGAGCGTAGACGGGATGTTAAGTCAGCTGTGAAAGTTTGGGGCTCAACCTTAAAATTGC
AGTTGAAACTGGCATTCTTGAGTGCGGCAGAGGCAGGCGGAATTCGTGGTGTAGCGGTGA

11. Generate a Sequence to community info file

map_to_seq_info.py \
projects/tutorial/inputs/tut01.emp5k.map.csv \
projects/tutorial/inputs/tut01.emp5k.seq_info.csv

12. Classify using these amplicon sequences
This step depends on the classifier, and settings used. One requires completely reduplicated results--where each and every amplicon sequence has an assigned OTU and classification.

We've written modules that can read in the output from QIIME and MOTHUR. To help with this tutorial, we've installed QIIME and the greengenes reference set for you.

Here's a brief set of steps to classify our reads with GreenGenes 97% OTU and closed OTU generation:

i) Make a fake mapping file:
This is to help with getting classification output for each sequence.

mkdir projects/tutorial/working/qiime
echo -e "#SampleID\tBarcodeSequence\tLinkerPrimerSequence\tDescription\ntut01\t\t\twithouterror" \
> projects/tutorial/working/qiime/map.txt

ii) Preprocess

split_libraries.py \
-m projects/tutorial/working/qiime/map.txt \
-f projects/tutorial/inputs/tut01.emp5k.fasta \
-o projects/tutorial/working/qiime -a 0 -b 0 -p

iii) Allow for per-sequence output (DECARD helper module)

QIIME_preprocess_bySeq.py \
projects/tutorial/working/qiime/seqs.fna \
projects/tutorial/working/qiime/post.fna

iv) Classify with closed OTU generation against the GreenGenes 97% OTU reference

pick_closed_reference_otus.py \
-i projects/tutorial/working/qiime/post.fna \
-o projects/tutorial/working/qiime/gg97closed \
-r references/gg_13_8_otus/rep_set/97_otus.fasta \
-t references/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt \
-s -f

v) Generate a DECARD-style OTU table from the BIOM format output

QIIME_generate_otu_table.py \
projects/tutorial/working/qiime/gg97closed/otu_table.biom \
projects/tutorial/outputs/qiime_gg97closed.otutable.csv \
-s projects/tutorial/inputs/tut01.emp5k.seq_info.csv \
-e your@email.com

Please replace your@email.com with your email address.

vi) Generate a phylogenetic tree

biom convert -i projects/tutorial/working/qiime/gg97closed/otu_table.biom \
-o projects/tutorial/working/qiime/gg97closed/otu_table.txt \
--to-tsv --header-key taxonomy

filter_tree.py -i references/gg_13_8_otus/trees/97_otus.tree \
-t projects/tutorial/working/qiime/gg97closed/otu_table.txt \
-o projects/tutorial/working/qiime/gg97closed/pruned.tre

13. Test OTU
We can use our classifier output (converted into our OTU table format) to test OTU generation fidelity.
To do so, for each pair of sequences we determine if they were from the same organism, and if so, are they correctly paired. Likewise, for sequences from different source organisms, we ask if they are appropriately split by the classifier.

mkdir projects/tutorial/analysis/otu

test_otu.py projects/tutorial/inputs/tut01.emp5k.map.csv \
projects/tutorial/outputs/qiime_gg97closed.otutable.csv \
projects/tutorial/analysis/otu/qiime_gg97closed.otu_test.csv

The output is in a CSV file

Community,Correct_Match,Correct_Split,Incorrect_Match,Incorrect_Split,Dropped,Total,sensitivity,specificity,ppv,npv,accuracy
Tut01CM0,1033292,11409274,0,0,1,4990,1.0,1.0,1.0,1.0,1.0
Tut01CM1,1144261,11233407,84860,0,0,4993,1.0,0.9925023857450969,0.930958790875756,1.0,0.9931907876154822
Tut01CM2,776520,11611139,59896,0,0,4990,1.0,0.9948679787182542,0.9283897008187314,1.0,0.995188131323782
....

(Your output will be different from this, as the communities are randomly generated and yours will be different than mine)

14. Test Classification
Next up, testing classification!

mkdir projects/tutorial/analysis/classification

test_classification.py \
projects/tutorial/inputs/tut01.emp5k.map.csv \ 
projects/tutorial/outputs/qiime_gg97closed.otutable.csv \
projects/tutorial/analysis/classification/qiime_gg97closed \
-s \
-e your@email.com

Output is in a few files. A JSON file is created with enough details for interactive visualization. CSV files are provided for species, genus and no targeted rank of classification. In each of these CSV files is the number of source organisms correctly classified, undercalled (in the correct clade, but not to the target resolution), overcalled (not relevant in this revision), miscalled sibling (correct parent, but wrong final rank), miscalled (down the wrong clade), and dropped.

community,correct,undercalled,overcalled,miscalled_sibs,miscalled,dropped
Tut01CM0,15.998914223669924,2.0,0.0,0.0,2.0,0.0010857763300760044
Tut01CM1,16.0,0.0,0.0,0.0,2.0,0.0
Tut01CM2,21.0,6.0,0.0,0.0,0.0,0.0
Tut01CM3,20.0,2.0,0.0,0.0,2.0,1.0
Tut01CM4,20.0,4.0,0.0,0.0,1.0,0.0
Tut01CM5,22.0,1.0,0.0,0.0,1.0,0.0
Tut01CM6,15.0,1.0,0.0,0.0,1.0,0.0
Tut01CM7,15.0,3.0,0.0,0.0,3.0,0.0
Tut01CM8,12.0,3.0,0.0,0.0,0.0,0.0
Tut01CM9,16.0,5.0,0.0,0.0,3.0,0.0

15. Diversity Fidelity
We can now compare how well our classifier output matches the true diversity of the community, and create a spiffy figure along the way too!

mkdir projects/tutorial/analysis/diversity/

otutable_diversity_comparison.py projects/tutorial/targets.csv \
--rRNA16SSU references/byGenus/ \
--seqinfo projects/tutorial/inputs/tut01.emp5k.seq_info.csv \
--statsfile projects/tutorial/analysis/diversity/diversity.xlsx \
--otutables projects/tutorial/outputs/qiime_gg97closed.otutable.csv

diversity_figure.py \
projects/tutorial/analysis/diversity/diversity.xlsx \
projects/tutorial/analysis/diversity/diversity.pdf

otutable_diversity_comparison.py takes multiple OTU table outputs (to compare different pipelines or settings), calculates Shannon Diversity for each community from the output, and calculates a Shannon Diversity for the true population. The output is an Excel file.

diversity_figure.py takes this Excel file and plots the true vs estimated Shannon alpha-diversity for each pipeline as a 2-D plot. It additionally calculates Spearman R^2 to test for monotonicity.

16. Pairwise-distance (beta-diversity) testing

First we need to create a 'true' phylogenetic tree from the full-length 16S sequences used to build our communities

mkdir projects/tutorial/tree

target_to_phylogenetic_tree.py projects/tutorial/targets.csv \
--output_dir projects/tutorial/tree/ \
--email your@email.com \
--cm /usr/local/lib/python2.7/dist-packages/deenurp-0.1.8.dev59-py2.7.egg/deenurp/data/RRNA_16S_BACTERIA.cm \
--threads 1

cp projects/tutorial/tree/RAxML_bestTree.target.tre projects/tutorial/targets.tre

Now we can proceed with comparing the true pairwise distances between the communities we generated, and the distances estimated by the classifiers.

To help with this, let's create a configuration file

echo -e "name,otutable,tree\nQIIME gg97 closed,projects/tutorial/outputs/qiime_gg97closed.otutable.csv,projects/tutorial/working/qiime/gg97closed/pruned.tre" > projects/tutorial/analysis/distance_config.csv
otutable_distance_comparison.py \
projects/tutorial/targets.csv projects/tutorial/targets.tre \
--rRNA16SSU references/byGenus/  \
--seqinfo projects/tutorial/inputs/tut01.emp5k.seq_info.csv \
--figure projects/tutorial/analysis/distance.pdf \
--distfile projects/tutorial/analysis/distance.xlsx \
--dpcoa \
--config projects/tutorial/analysis/distance_config.csv

Two outputs are produced: An Excel File (with many subsheets) and a PDF with each distance method given a page, a 2D plot of true pairwise distance versus estimated pairwise distance from the classifier by that method.

Docker Pull Command
Owner
golob