Public | Automated Build

Last pushed: 2 years ago
Short Description
Short description is empty for this repo.
Full Description

Census 0.9.1

Census is a tool to estimate the complexity of sequencing libraries
from read count samples. See the wiki at
https://github.com/matted/census/wiki for more details, including
guides to interpreting the results of Census.

Installation:

You can get Census by pulling it from git:

git clone https://github.com/matted/census.git

... or by downloading a zip: https://github.com/matted/census/archive/v0.9.1.zip.

To run Census, several Python packages are required. On a Ubuntu-like
system, these commands will get the appropriate dependencies:

sudo apt-get install python python-dev build-essential python-setuptools python-numpy python-scipy python-pylab
sudo easy_install pysam

If you don't have root permissions on your system, but you already
have Python, setuptools, gcc, Scipy, and Numpy, you can get Census
working by cloning it, moving into the new directory, and running:

python setup.py install --user

This will install the pysam dependency in your local user directory.
The Scipy and Numpy dependencies are best installed at the system
level since they require several non-Python components.

If you want the Census tools on your system path (and want to get the
pysam dependency automatically), install Census with:

sudo python setup.py install

There is also a Docker image that has Census and its dependencies
preinstalled. See https://github.com/matted/census/wiki/Docker.

Quick usage:

Census operates in two phases, a read duplicate count generation step
and an estimation step.

./bam_to_histo.py dummy.bed input.bam | ./calculate_libsize.py -

The default is to use paired-end information to improve the accuracy
of duplicate detection. Since this won't work for single-end reads,
those experiments must be analyzed with the "-s" option passed to
bam_to_histo.py.

The reads in the input bam must be coordinate-sorted. The input bed
serves a dual purpose: it gives regions that should be filtered out in
duplicate detection, and only the chromosomes appearing in the bed
file will be used to create duplicates. This allows for quick
filtering of mitochondrial reads and other sources that do not carry
the same assumptions as the rest of the genome.

Filtering regions for hg19 described by the Pritchard lab (Pickrell et
al., Bioinformatics 2011) are included in the repository (downloaded
from http://eqtl.uchicago.edu/Masking/). For more species, see the
ENCODE filtering lists at
https://sites.google.com/site/anshulkundaje/projects/blacklists.

Extended usage and options:

Histogram generator usage:

usage: bam_to_histo.py [-h] [-v] [-s] [-q MAPQ] [-d MINDIST] [-r REGEXP]
                   excluded_regions.bed sorted_reads.bam

Histogram generator for Census library complexity package.

positional arguments:
  excluded_regions.bed
  sorted_reads.bam

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -s, --single_ended    Include only single-ended reads, instead of only
                        paired-end reads where both ends map.
  -q MAPQ, --mapq MAPQ  Minimum read mapping quality for a read or read pair
                        to be included. Default is 1.
  -d MINDIST, --mindist MINDIST
                        Maximum distance in reported flowcell coordinates for
                        reads to be considered optical duplicates. Default is
                        100.
  -r REGEXP, --regexp REGEXP
                        Regular expression for finding flowcell coordinates
                        from read names. Default is [\w\.
                        ]+:([\d]):([\d]+):([\d]+):([\d]+).*

Library complexity estimation usage:

usage: calculate_libsize.py [-h] [-v] [-l MINCOUNT] [-r MAXCOUNT]
                        [-s SUBSAMPLE]
                        count_histogram.txt

Census, library complexity estimation.

positional arguments:
  count_histogram.txt   File for duplicate count histogram, or - for stdin.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -l MINCOUNT, --mincount MINCOUNT
                        Minimum duplicate count to use in estimation. Default
                        is 1.
  -r MAXCOUNT, --maxcount MAXCOUNT
                        Maximum duplicate count to use in estimation. Default
                        is 10.
  -s SUBSAMPLE, --subsample SUBSAMPLE
                        Fraction of counts to use (float), useful for testing.
                        Default is 1 (no downsampling).
Docker Pull Command
Owner
matted
Source Repository

Comments (0)