Public | Automated Build

Last pushed: a year ago
Short Description
Dockerfile to run the coral pilot nextflow pipeline
Full Description

Coral Reefs Metagenomics -- Pilot Study

Table of contents


This study is aimed at assessing our sampling strategy and methods as well as extrapolating the amount of sequencing needed for a future main experiment.

The future main experiment in question consists in sequencing water and soil samples from 6 different sampling sites in Kenya and Mauritius.

In this pilot study we sampled 4 water samples: 2 close to the shore, 2 close to coral reefs in the marine sanctuary of Kuruwitu, Kenya.

These water samples have had their DNA extracted using MoBio water kits. Sequencing library was prepared using TruSeq, and the samples have been sequenced in 2 MiSeq runs, for approximately 7Gb of data per samples.

This document describes the data analysis.

Raw Data

The Raw data were deposited in a private repository at the European Nucleotide
Archive (ENA)

  • Project: PRJEB20178

To start reproducing the results of this pilot study, create a folder called
00_data and download the reads there.

Quality Control

Reports containing information about the sequence quality was provided by SciLifeLab.

As usual, we notice a few adapters sequences left, as well as declining quality
scores towards the end of the sequences

We use scythe and
sickle, to respectively trim the 3'-end
adapters and trim bases of poor quality.

Adapter trimming

The reports tell us that the tags are GTTTCG, CGTACG, GAGTGG and ACTGAT.

File used for removing the adapters: adapters.fasta

After adapter removal the reads were trimmed with a base quality threshold of
20, and a min length of 50bp.

To run both the adapter / quality trimming and the computing part of nonpareil,
run the nextflow pipeline:

docker pull hadrieng/slu_corals_pilot
nextflow run

Nonpareil curves

Nonpareil uses the redundancy of the reads in metagenomic datasets to
estimate the average coverage and predict the amount of sequences that
will be required to achieve “nearly complete coverage”.

In short, nonpareil is a software that is gonna help us figure out how much we
need to sequence in order to get (almost) every species from our
metagenomic samples.

After adapter removal and quality trimming, technical replicates and reads pairs have been pooled together.

According to the author of the tool:

The problem with paired reads is that they are not independent events. If
your average insert size is longer than twice the read length, the coverage
will likely be underestimated. On the other hand, if you have small average
insert sizes (close to a read length), the coverage will likely be
overestimated. However, if you have a small insert size and you can combine
most pairs (e.g., using PEAR), the resulting merged reads are probably better
estimators. In all other cases, I recommend simply using one of the sister

So, without the reads pooled together:

Sample Coverage (%) 80% (Gb) 85% (Gb) 90% (Gb) 95% (Gb)
baseline filt 0.37 30.78 48.34 85.99 205.21
baseline unfilt 0.32 34.27 51.30 85.75 186.01
coral filt 0.30 43.31 66.46 114.69 261.25
coral unfilt 0.36 41.49 68.48 129.89 342.12

45Gb on the corals (which is 1 HiSeq 2500 lane) should give us ~ 80% of our

Taxonomic Classification

We use kaiju for taxonomic classification. This step has not been included in the nextflow pipeline since it is too computationally intensive for a consumer grade laptop (~50G RAM required)

for i in ../01_qc/*_R1_trimmed.fastq
        prefix=$(basename $i _R1_trimmed.fastq)
        kaiju -t $KAIJUDB/nodes.dmp -f $KAIJUDB/kaiju_db_nr.fmi \
            -i ../test/${prefix}_R1_trimmed.fastq \
            -j ../test/${prefix}_R2_trimmed.fastq \
            -o ${prefix}.txt -z 24
for report in *.txt
        kaijuReport -t $KAIJUDB/nodes.dmp -n $KAIJUDB/names.dmp \
            -i $report -r species \
            -l superkingdom,phylum,order,class,family,genus,species \
            -o $(basename $report .txt)_summary.txt

for vizualising kaiju results, we'll use pavian. Pavian only take kraken or centrifuge results as input, so we'll need to convert the kaiju results to be compatible with pavian.

cd kaiju_2_kraken
for sum in ../03_taxonomy/ERS16463*_summary.txt
    do ./ \
        $sum > ../03_taxonomy/$(basename $sum _summary.txt)_pavian.txt

Now launch pavian with Docker:

docker pull 'florianbw/pavian'
docker run --rm -p 5000:80 florianbw/pavian

And visit in your browser to access pavian.

Comparison with previous studies

In all samples the 3 dominant families are Flavobacteriaceae, Rhodobacteraceae and Pelagibacteraceae

In all samples, by far the dominant genus was Candidatus pelagibacter

Most studies investigating coral health have been done with 16s amplification, and conducted with mucus, skeleton and tissue samples.

It has been shown that the microbiome of the corals (wheter it is mucus, skeleton or else) is widely different of the microbiome of seawater.

Metagenome Assembly

/opt/sge/scripts/ -e -c 24 -m 5 -t 96 -1 02_trimming/ERS1646376_R1_trimmed.fastq,02_trimming/ERS1646380_R1_trimmed.fastq -2 02_trimming/ERS1646376_R2_trimmed.fastq,02_trimming/ERS1646380_R2_trimmed.fastq -o 04_megahit/coral_unfiltered

bioawk -c fastx '{ if(length($seq) > 1500) { print ">"$name; print $seq }}' coral_filtered/final.contigs.fa > coral_filtered/contigs_1500bp.fasta

stats for coral_filtered

sum = 1179639404, n = 3437909, ave = 343.13, largest = 23553
N50 = 301, n = 1476267
N60 = 301, n = 1868174
N70 = 301, n = 2260081
N80 = 301, n = 2651988
N90 = 301, n = 3043895
N100 = 200, n = 3437909
N_count = 0
Gaps = 0

stats for coral_filtered/contigs_1500bp.fasta
sum = 12636630, n = 5793, ave = 2181.36, largest = 23553
N50 = 2052, n = 2111
N60 = 1891, n = 2754
N70 = 1763, n = 3447
N80 = 1656, n = 4187
N90 = 1573, n = 4971
N100 = 1501, n = 5793
N_count = 0
Gaps = 0


Map the reads to the assembly using bowtie2:

/opt/sge/scripts/ -e -c 8 -m 20 -t 48 -r 04_megahit/coral_filtered/final.contigs.fa -1 05_binning/concatenated_reads/coral_filt_R1.fastq -2 05_binning/concatenated_reads/coral_filt_R2.fastq -o 05_binning/coral_filtered/

Conclusion and thoughts for main study


Our main concern while sampling was that we'd sample mostly eukaryotic DNA, i.e plankton or human contamination. Fortunately this is not the case.

It also seems that pre-filtering does not drastically change the composition of the samples. We seem to have had slightly more data with the pre-filtered samples.


The sampled microbiome seem to be way more diverse than expected. a 45Gb run (1 HiSeq lane) would give us ~ 80% of our total population. It is not feasible for economic reasons to go higher at the moment. Maybe if the NovaSeq comes out with a big cost per base reduction.

Unfortunately, 45Gb per sample for also render the main study too expensive, which combined to my next pain, leads to arguing that we need to change our experimental design and goals for this first study

Future experiment

It appears that the vast majority of the sampled bacteria are typical from a seawater microbiome. Vibrio, a known coral pathogen is present in small quantity in all the samples, even baseline water. To assess coral health, we'd need to sample the mucus, which is known to be colnized by vibrio in time of disease.

Additionally, this study would only harbor an acceptable level of statistical power with a huge sequencing effort, for which we don't have the money. It is consistent with the theory that we sampled the seawater microbiome, believed to be way more diverse than coral microbiome.

For the main study I recommend that we reduce the number of replicates and switch our focus from comparing sanctuaries and polluted sites, to assessing a baseline metagenome assembly, with sampling 4 sites in total: 2 in Kenya and 2 in Mauritius --> 8 samples total for soil and water.

With ultra-deep sequencing of those 8 samples, we'd stay in reasonable costs, while theorizing that soil and water host bacteria contributing to the coral ecosystem. Such amount of sequencing would allow us to perform an assembly and try to discover functional genes with functions beneficial for the ecosystem.

Another future study, given additional funds, would be to sequence DNA and RNA from the mucus microbiome to characterize the function, both from a nutrition, protection and disease mechanism point of view in the 6 sites previously discuss, with 5 to 10 replicates.

Docker Pull Command
Source Repository