Public Repository

Last pushed: 6 hours ago
Short Description
image for running Gene_Prioritization_Pipeline code
Full Description

KnowEnG's Gene Prioritization Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH, BD2K Center of Excellence, Gene Prioritization Pipeline.

This pipeline ranks the rows of a given spreadsheet, where spreadsheet's rows correspond to gene-labels and columns correspond to sample-labels. The ranking is based on correlating gene expression data (network smoothed) against pheno-type data.

There are four prioritization methods that one can choose from:

Options Method Parameters
Correlation correlation correlation
Bootstrap Correlation bootstrap sampling correlation bootstrap_correlation
Correlation with network regularization network-based correlation net_correlation
Bootstrap Correlation with network regularization bootstrapping w network correlation bootstrap_net_correlation

Note: all of the correlation methods mentioned above use the Pearson or t-test correlation measure method.

How to run this pipeline's docker image with Our data

1. Install Docker - follow the instructions in this link.

Install Docker Engine

2. Get the current Docker Image. On the command line with internet connection:

docker pull knowengdev/gene_prioritization_pipeline:07_26_2017

3. Create or change to a directory to hold the output data.

mkdir local_results
and / or
cd local_results

4. Start the docker image connected to the (local_results) directory.

docker run -v local_results:/home/test/run_dir -it knowengdev/gene_prioritization_pipeline: 07_26_2017

5. At the docker image command prompt change to the test directory.

.../home# cd test

6. Set up the environment.

.../home/test# make env_setup

7. Use one of the following "make" commands to select and run a clustering option:

Command Option
make run_pearson correlation
make run_bootstrap_pearson bootstrap sampling correlation
make run_net_pearson correlation with network regularization
make run_bootstrap_net_pearson bootstrap correlation with network regularization
make run_t_test correlation
make run_bootstrap_t_test bootstrap sampling correlation
make run_net_t_test correlation with network regularization
make run_bootstrap_net_t_test bootstrap correlation with network regularization

How to run this pipeline with Your data.

Perform steps 1-3 as described above and use the local_results directory

Create a custom YAML file, move that and your spreadsheet to the local_results directory

see the git hub directions
Gene Prioritization ReadMe on github

make sure file named by YAML key "spreadsheet_name_full_path:" is in your local_results directory
the path-names inside docker depend on the way you "mount" the run directory in step 4
therefore you may have to change the YAML path's to ../../ instead of ../
make sure your custom YAML file is in the local_results directory

4. Start docker with the container connected to your (local_results) directory.

docker run -v local_results:/home/test/run_dir -it knowengdev/gene_prioritization_pipeline:04_26_2017

5. Change to the directory mounted in step 4.

.../home# cd ./test/run_dir
.../home/test/run_dir# ls

the files in the local_results directory will be visible in this directory

6. Run Gene Prioritization with the options in your custom YAML file.

python3 ../../src/ -run_directory ./ -run_file zTEMPLATE_GP_BENCHMARKS.yml

Description of "run_parameters" file

Key Value Comments
method correlation or net_correlation or bootstrap_correlation or bootstrap_net_correlation Choose gene prioritization method
correlation_measure pearson or t_test Choose correlation measure method
gg_network_name_full_path directory+gg_network_name Path and file name of the 4 col network file
spreadsheet_name_full_path directory+spreadsheet_name Path and file name of user supplied gene sets
phenotype_name_full_path directory+drug_response Path and file name of user supplied drug response file
results_directory directory Directory to save the output files
number_of_bootstraps 5 Number of random samplings
cols_sampling_fraction 0.9 Select 90% of spreadsheet columns
rwr_max_iterations 100 Maximum number of iterations without convergence in random walk with restart
rwr_convergence_tolerence 0.01 Frobenius norm tolerence of spreadsheet vector in random walk
rwr_restart_probability 0.5 alpha in V_(n+1) = alpha * N * Vn + (1-alpha) * Vo
top_beta_of_sort 100 Number of top genes selected

gg_network_name = STRING_experimental_gene_gene.edge</br>
spreadsheet_name = CCLE_Expression_ensembl.df</br>
drug_response = CCLE_drug_ec50_cleaned_NAremoved_pearson.txt

Description of Output files saved in results directory

  • Any method saves separate files per phenotype with name


Genes are sorted in descending order based on visualization_score

Response Gene_ENSEMBL_ID quantitative_sorting_score visualization_score baseline_score
phenotype 1 gene 1 float float float
... ... ... ... ...
phenotype 1 gene n float float float
  • Any method saves sorted genes for each phenotype with name


Ranking phenotype 1 phenotype 2 ... phenotype n
1 gene (most significant) gene (most significant) ... gene (most significant)
... ... ... ... ...
n gene (least significant) gene (least significant) ... gene (least significant)
  • Any method saves spreadsheet with top ranked genes per phenotype with name


Genes phenotype 1 ... phenotype n
gene 1 1/0 ... 1/0
... ... ... ...
gene n 1/0 ... 1/0
Docker Pull Command