What is MindTheGap ?
MindTheGap performs detection and assembly of DNA insertion variants in NGS read datasets with respect to a reference genome. It is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome. It takes as input a set of reads and a reference genome. It outputs two sets of FASTA sequences: one is the set of breakpoints of detected insertion sites, the other is the set of assembled insertions for each breakpoint.
CMake 2.6+; see http://www.cmake.org/cmake/resources/software.html
c++ compiler; compilation was tested with gcc and g++ version>=4.5 (Linux) and clang version>=4.1 (Mac OSX).
Getting the latest source code with git
# get a local copy of MindTheGap source code git clone --recursive https://github.com/GATB/MindTheGap.git # compile the code cd MindTheGap sh INSTALL # the binary file is located in directory build/bin/ ./build/bin/MindTheGap -help
Note: when updating your local repository with
git pull, if you see that thirdparty/gatb-core has changed, you have to run also :
git submodule update.
Installing a stable release
Retrieve a binary archive file from one of the official MindTheGap releases (see "Releases" tab on the Github web page); file name is
MindTheGap-vX.Y.Z-bin-Linux.tar.gz (for Linux) or
MindTheGap-vX.Y.Z-bin-Darwin.tar.gz (for MacOs).
tar -zxf MindTheGap-vX.Y.Z-bin-Darwin.tar.gz cd MindTheGap-vX.Y.Z-bin-Darwin chmod u+x bin/MindTheGap # run a simple example ./bin/MindTheGap find -in data/reads_r1.fastq,data/reads_r2.fastq -ref data/reference.fasta -out example ./bin/MindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example
In case the software does not run appropriately on your system, you should consider to install it from its source code. Retrieve the source archive file
tar -zxf MindTheGap-vX.Y.Z-Source.tar.gz cd MindTheGap-vX.Y.Z-Source sh INSTALL # the binary file is located in directory build/bin/ ./build/bin/MindTheGap -help
Pull the docker image of the latest release of MindTheGap:
docker pull clemaitr/mindthegap
MindTheGap is a software that performs integrated detection and assembly of genomic insertion variants in NGS read datasets with respect to a reference genome. It is designed to call insertions of any size, whether they are novel or duplicated, homozygous or heterozygous in the donor genome.
It takes as input a set of reads and a reference genome. It outputs two sets of FASTA sequences: one is the set of breakpoints of detected insertion sites, the other is the set of assembled insertions for each breakpoint. For each breakpoint, MindTheGap either returns a single insertion sequence (when there is no assembly ambiguity), or a set of candidate insertion sequences (due to ambiguities) or nothing at all (when the insertion is too complex to be assembled).
MindTheGap performs de novo assembly using the GATB library and inspired from algorithms from Minia. Hence, the computational resources required to run MindTheGap are significantly lower than that of other assemblers (for instance it uses less than 6GB of main memory for analyzing a full human NGS dataset).
Since version 2.0.0, MindTheGap can detect other types of variants, not only insertion events. These are homozygous SNPs and homozygous deletions of any size. They are detected by the find module of MindTheGap and are output separately in a VCF file. Importantly, even if the user is not interested in these types of variants, it is worth to detect them since it can improve the recall of the insertion event detection algorithm : it is now possible to find insertion events that are located at less than k nucleotides from an other such variant.
For more details on the method and some recent results, see the web page.
MindTheGap is composed of two main modules : breakpoint detection (find module) and the local assembly of insertions (fill module). Both steps are implemented in a single executable, MindTheGap, and can be run independently by specifying the module name as follows :
MindTheGap <module> [module options]
Basic command lines
#Find module: MindTheGap find (-in <reads.fq> | -graph <graph.h5>) -ref <reference.fa> [options] #To get help: MindTheGap find -help #Fill module: MindTheGap fill (-in <reads.fq> | -graph <graph.h5>) -bkpt <breakpoints.fa> [options] #To get help: MindTheGap fill -help
Input read data
For both modules, read dataset(s) are first indexed in a De Bruijn graph. The input format of read dataset(s) is either the read files themselves, or the already computed de bruijn graph in hdf5 format (.h5). In the first case, the option is
-inand the user can provide the de Bruijn graph building options, in the second case the option is -graph and only options for the detection or assembly are to be given.
-graphare mutually exclusive, and one of these is mandatory.
If the input is composed of several read files, they can be provided as a list of file paths separated by a comma or as a "file of file" (fof), that is a text file containing on each line the path to each read file. All read files will treated as if concatenated in a single sample. The read file format can be fasta, fastq or gzipped.
de Bruijn graph creation options
In addition to input read set(s), the de Bruijn graph creation uses two main parameters,
-kmer-size: the k-mer size [default '31']. By default, the largest kmer-size allowed is 128. To use k>128, you will need to re-compile MindTheGap with the two commands in the build directory:
cmake -DKSIZE_LIST="32 64 96 256" ..and then
make. To go back to default, replace 256 by 128. Note that increasing the range between two consecutive kmer-sizes in the list can have an impact on the size of the output h5 files (but none on the results).
-abundance-min: the minimal abundance threshold, k-mers having less than this number of occurrences are discarded from the graph [default 'auto', ie. automatically inferred from the dataset].
-abundance-max: the maximal abundance threshold, k-mers having more than this number of occurrences are discarded from the graph [default '2147483647' ie. no limit].
Find module specific options
In addition to the read or graph files, the find module has one mandatory option
-refand several optional options:
-ref: the path to the reference genome file (in fasta format).
-max-rep: maximal repeat size allowed for fuzzy sites [default '5'].
-snp-min-val: minimal number of kmers to validate a SNP [default '5']. A SNP is validated if it is validated by at least this number of consecutive overlapping kmers. This corresponds also to the minimal distance between two consecutive SNPs to be able to detect both of them.
-het-max-occ: maximal number of occurrences of a (k-1)mer in the reference genome allowed for heterozyguous insertion breakpoints [default '1']. In order to detect an heterozyguous insertion breakpoints, both flanking k-1-mers, at each side of the insertion site, must have strictly less than this number of occurrences in the reference genome. This prevents false positive predictions inside repeated regions. Warning : increasing this parameter may lead to numerous false positives (genomic approximate repeats).
-no-[type]: to disable the detection of certain types of variants.
-[type]-only: to detect only certain types of variants.
NOTE: MindTheGap can find mainly homozygous variants, except for insertion variants where it can also find heterozygous variants. Therefore -homo-only and -hete-only only applies to insertion variants.
Fill module specific options
In addition to the read or graph files, the fill module has one mandatory option
-bkptand several optional options:
-bkpt: the breakpoint file path. This is one of the output of the Find module and contains for each detected insertion site its left and right kmers from and to which the local assembly will be performed (see section E for details about the format).
-max-nodes: maximum number of nodes in contig graph [default '100']. This arguments limits the computational time, this is especially useful for complex genomes.
-max-length: maximum length of insertions (nt) [default '10000']. This arguments limits the computational time, this is especially useful for complex genomes.
All the output files are prefixed either by a default name: "MindTheGap_Expe-[date:YY:MM:DD-HH:mm]" or by a user defined prefix (option
Both MindTheGap modules generate the graph file if reads were given as input:
a graph file (
.h5). This is a binary file, to obtain information stored in it, you can use the utility program dbginfo located in your bin directory or in ext/gatb-core/bin/.
MindTheGap findgenerates the following output files:
- a breakpoint file (
.breakpoints) in fasta format. It contains the breakpoint sequences of each detected insertion site. Each insertion site corresponds to 2 consecutive entries in the fasta file : sequences are the left and right side flanking kmers.
a variant file (
.othervariants.vcf) in vcf format. It contains SNPs and deletion events.
MindTheGap fillgenerates the following output files:
- a sequence file (
.insertions.fasta) in fasta format. It contains the inserted sequences that were successfully assembled. The location of each insertion on the reference genome can be found in its fasta header, together with the insertion length and quality score.
an insertion variant file (
.insertions.vcf) in vcf format. This file resumes insertion position information already contained in the fasta file, but in a vcf format. It is not self-sufficient, inserted sequences if larger than XX bp are not written in the vcf but are referred to their fasta id in the fasta file.
Warning: the output in vcf of insertion variants is not yet implemented, coming soon...
Computational resources options
Additional options are related to computational runtime and memory:
-nb-cores: number of cores to be used for computation (graph creation and breakpoint detection) [default '0', ie. all available cores will be used].
-max-memory: max RAM memory for the graph creation (in MBytes) [default '2000']. Increasing the memory will speed up the graph creation phase.
-max-disk: max usable disk space for the graph creation (in MBytes) [default '0', ie. automatically set]. Kmers are counted by writting temporary files on the disk, to speed up the counting you can increase the usable disk space.
Details on output formats
A breakpoint file is output by MindTheGap find and is required by MindTheGap fill. This is a plain text file in fasta format, where each insertion site (or gap to fill) corresponds to two consecutive fasta entries: left kmer and right kmer. Sequences are kmer (with k being the same value used in the de bruijn graph). MindTheGap fill will try to find a path in the de bruijn graph from the left kmer to the right one.
In the breakpoint file output by MindTheGap find, one can find useful information in the fasta headers, such as the genomic position of the insertion site and its genotype (detected by the homozygous or heterozygous algorithm). A typical header is as follows:
>bkpt5_chr1_pos_39114_fuzzy_0_HOM left_kmer #bkpt5 : this is the id of the insertion event. #chr1_pos_39114 : the position of the insertion site is on chr1 at position 39114 (position just before the insertion, 1-based). #fuzzy_0 : is the size of the repeated sequence at the breakpoint (here 0 means it is a clean insertion site). #HOM : it was detected by the homozygous algorithm.
Note: in the case of a small repeat at the breakpoint site (fuzzy>0), the exact position can not be known inside the repeat, the reported position is always the right-most.
Note #2: sometimes the header can contain the word
right kmer. This concerns also repeated sequences but must not be confused with the
fuzzyfield. Fuzzy indicates if a small repeat (typically <5 bp) is exactly repeated at the breakpoint site and at an extremity of the inserted sequence (this can happen very often by chance and may have no biological meaning). Whereas "REPEATED" indicates that this insertion site is probably located in a repeated region of the reference genome, with the repeat size being >=(k-1). These breakpoints have more probability to be false positives.
VCF variant format
For both .othervariants.vcf and insertions.vcf files, the format follows the VCF specifications version 4.1 (see https://samtools.github.io/hts-specs/VCFv4.1.pdf). Positions are 1-based.
Assembled insertion format
MindTheGap fill outputs a file in fasta format containing the obtained inserted sequences. Breakpoint kmers are not included in the output sequences. For each insertion breakpoint for which the filling succeeded, one can find in this file either one or several sequences with the following header:
>bkpt5_chr1_pos_39114_fuzzy_0_HOM_len_59_qual_50 #same info as in the breakpoint file #len_59 : the length in bp of the inserted sequence, here 59 bp #qual_50 : quality of 50 (quality scores range from 0 to 50, 50 being the best quality)
If more than one sequence are assembled for a given breakpoint, the header is as follows:
>bkpt5_chr1_pos_39114_fuzzy_0_HOM_len_57_qual_15 solution 2/3 #this is the second sequence out of 3
Each insertion is assigned a quality score ranging from 0 (low quality) to 50 (highest quality). This quality score reflects mainly repeat-associated criteria:
qual=5: if one of the breakpoint kmer could not be found exactly but with 2 errors (mismatches)
qual=10: if one of the breakpoint kmer could not be found exactly but with 1 error (mismatch)
qual=15: if multiple sequences can be assembled for a given breakpoint (note that to output multiple sequences, they must differ from each other significantly, ie. <90% id)
qual=25: if one of the breakpoint kmer is repeated in the reference genome (REPEATED field in the breakpoint file)
This example can be run with the small dataset in directory
data/, for instance from the build directory:
#find bin/MindTheGap find -in ../data/reads_r1.fastq,../data/reads_r2.fastq -ref ../data/reference.fasta -out example # 3 files are generated: # example.h5 (graph), # example.othervariants.vcf (SNPs and deletion variants), # example.breakpoints (breakpoints of insertion variants). #fill bin/MindTheGap fill -graph example.h5 -bkpt example.breakpoints -out example # 2 files are generated: # example.insertions.fasta # example.insertions.vcf (not yet implemented, coming soon)
Either in your bin/ directory or in ext/gatb-core/bin/, you can find additional utility programs :
- dbginfo : to get information about a graph stored in a .h5 file
- dbgh5 : to build a graph from read set(s) and obtain a .h5 file
- h5dump : to extract all data stored in a .h5 file
MindTheGap: integrated detection and assembly of short and long insertions. Guillaume Rizk, Anaïs Gouin, Rayan Chikhi and Claire Lemaitre. Bioinformatics 2014 30(24):3451-3457. http://bioinformatics.oxfordjournals.org/content/30/24/3451
Web page with some updated results.
To contact a developer, request help, or for any feedback on MindTheGap, please use the issue form of github: https://github.com/GATB/MindTheGap/issues
If you do not have any github account, you can also send an email to claire dot lemaitre or guillaume dot rizk at inria dot fr