Adds and replaces read group tags in BAM files
samtools addreplacerg to add a new read group and assign all reads to it in a set of BAM files using their names.
This scripts takes a set of BAM files (called
*.bam) grouped in a single folder as an input.
In each BAM file, the read group ID and SM fields will be set to BAM file name after removing the
.bam extension. The command applied to each file is:
samtools addreplacerg -r "@RG\tID:file_name\tPG:samtools addreplacerg\tSM:file_name}"
How to install
Install java JRE if you don't already have it.
curl -fsSL get.nextflow.io | bash
And move it to a location in your
/usr/local/binfor example here):
sudo mv nextflow /usr/local/bin
Install samtools 1.3 or above and add it to your path. Alternatively, you can use the docker image provided (see below).
How to run
nextflow run iarcbioinfo/addreplacerg-nf --bam_folder BAM/
By default, BAM files produced are output in the same folder as the input folder. One can also specify the output folder by adding the optional argument
--out_folder BAM_RG to the above command line for example.
If you don't have
samtools you can use the docker image we provide containing it using:
nextflow run iarcbioinfo/addreplacerg-nf -with-docker --bam_folder BAM/
The exact same pipeline can be run on your computer or on a HPC cluster, by adding a nextflow configuration file to choose an appropriate executor. For example to work on a cluster using SGE scheduler, simply add a file named
nextflow.config in the current directory (or
~/.nextflow/config to make global changes) containing:
process.executor = 'sge'
Other popular schedulers such as LSF, SLURM, PBS, TORQUE etc. are also compatible. See the nextflow documentation here for more details. Also have a look at the other parameters for the executors, in particular
queueSize that defines the number of tasks the executor will handle in a parallel manner.
The default number of tasks the executor will handle in a parallel is 100, which is certainly too high if you are executing it on your local machine. In this case a good idea is to set it to the number of computing cores your local machine has. Following is an example to create a config file with this information automatically (works on Linux and Mac OS X):
echo "executor.\$local.queueSize = "`getconf _NPROCESSORS_ONLN` > ~/.nextflow/config
>> if you want to add the argument line to an existing nextflow config file.