Public Repository

Last pushed: 10 days ago
Short Description
Short description is empty for this repo.
Full Description

KnowEnG's Data Cleanup Pipeline

This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Data Cleanup Pipeline.

This pipeline cleanup the data of a given spreadsheet. Given a spreadsheet this pipeline maps gene-label row names to Ensemble-label row names and checks data formats. It will go through the following steps

Detailed cleanup logic for each pipeline

geneset_characterization_pipeline

After removing empty rows and columns, check if a spreadsheet:

  1. is empty.
  2. contains NaN value/s column wise.
  3. contains value 0 and 1.
  4. gene name has NaN value.
  5. contains duplicate column names.
  6. contains duplicate row names.
  7. gene names can be mapped to ensemble gene name.

samples_clustering_pipeline

After removing empty rows and columns, check if a spreadsheet:

  1. contains NaN value/s column wise.
  2. contains real values (then replace with their absolute value)
  3. gene name contains NaN value.
  4. contains duplicate column name.
  5. contains duplicate row name.
  6. gene name can be mapped to ensemble gene name.
  7. intersects gene-gene network data (network option only)

    If the user provides with the phenotype data:
    After removing empty rows and columns, check if a phenotypic spreadsheet:

  8. contains duplicate column name.
  9. contains duplicate row name.
  10. intersects with the genomic spreadsheet.

If the user provides with the network data:

  1. is empty.
  2. intersects with genomic spreadsheet.

gene_prioritization_pipeline

After removing empty rows and columns, check if a spreadsheet:

  1. genomic or phenotypic data is empty.
  2. column contains NaN value/s.
  3. contains real value.
  4. contains NaN gene name in user spreadsheet.
  5. contains duplicate column name.
  6. contains duplicate row name.
  7. gene name can be mapped to ensemble gene name.

After removing empty rows and columns, check if a phenotypic spreadsheet:

  1. for every single drug:
    1. drops NA.
    2. intersects header with spreadsheet header, number of intersection >= 2.
  2. for t_test, contains only value 0, 1 or NaN.
  3. for pearson test, contains only real value or NaN

pasted_gene_list

After removing empty rows and columns, check if a spreadsheet:

  1. input genes contains NaN value/s.
  2. casts index of input genes dataframe to string type
  3. intersects with universal genes list from redis database

general_clustering_pipeline

After removing empty rows and columns, check if a spreadsheet:

  1. contains NaN value/s column wise.
  2. contains real value.
  3. contains NaN value in gene name.
  4. contains NaN value in header.
  5. contains duplicate row names.
  6. contains duplicate column names.

If the user provides with the phenotype data:
After removing empty rows and columns, check if a phenotypic spreadsheet:

  1. contains duplicate column name.
  2. contains duplicate row name.
  3. intersects with the genomic spreadsheet.

signatuer_analysis_pipeline

After removing empty rows and columns, check if a spreadsheet:

  1. contains NaN value/s column wise.
  2. contains positive real value.
  3. contains NaN value in gene name.
  4. contains NaN value in header.
  5. contains duplicate row names.
  6. contains duplicate column names.
  7. gene name can be mapped to ensemble gene name.

After removing empty rows and columns, check if a signature data:

  1. intersects with spreadsheet.

If the user provides with the network data, check if a network:

  1. find unique genes.
  2. intersects with signature data and spreadsheet on genes.

feature_prioritization_pipeline

After removing empty rows and columns, check if a spreadsheet:

  1. contains NaN value/s column wise.
  2. contains real value.

After removing empty rows and columns, check if a phenotypic spreadsheet:

  1. for t_test, contains only value 0, 1 or NaN.
  2. for pearson test, contains only real value or NaN.

phenotype_prediction_pipeline

After removing empty rows and columns, check if a spreadsheet:

  1. contains NaN value/s column wise.
  2. contains real value.
  3. contains NaN value in gene name.
  4. contains NaN value in header.
  5. contains duplicate row names.
  6. contains duplicate column names.
  7. gene name can be mapped to ensemble gene name.

After removing empty rows and columns, check if a phenotypic spreadsheet:

  1. intersects with spreadsheet on phenotype.
  2. for pearson test, contains only real value or NaN.

How to run this pipeline with our data


  1. Install Docker - follow the instructions in this link.

Install Docker Engine

  1. Get the current Docker Image. On the command line with internet connection:

docker pull knowengdev/data_cleanup_pipeline: 07_26_2017

  1. Create or change to a directory to hold the output data.

mkdir local_results
and / or
cd local_results

  1. Start the docker image connected to the (local_results) directory.

    docker run -v local_results:/home/test/run_dir -it knowengdev/data_cleanup_pipeline: 07_26_2017

  2. At the docker image command prompt change to the test directory.

    .../home# cd test

  3. Set up the environment.

    .../home/test# make env_setup

  4. Use one of the following "make" commands to select and run a data cleanup pipeline

Command Option
make run_data_cleaning example test
make run_samples_clustering_pipeline samples clustering test
make run_gene_prioritization_pipeline_pearson pearson correlation test
make run_gene_prioritization_pipeline_t_test t-test correlation test
make run_geneset_characterization_pipeline geneset characterization test
make run_pasted_gene_list pasted gene list test

How to run this pipeline with Your data


Follow steps 1-4 above then do the following:

  1. Create your run directory

    mkdir run_directory
    
  2. Change directory to the run_directory

    cd run_directory
    
  3. Create your results directory

    mkdir results_directory
    
  4. Create run_paramters file (YAML Format)

    Look for examples of run_parameters in ./Data_Cleanup_Pipeline/data/run_files/TEMPLATE_data_cleanup.yml
    
  5. Modify run_paramters file (YAML Format)

    set the spreadsheet, and drug_response (phenotype data) file names to point to your data
    
  6. Run the Data Cleanup Pipeline:

    • Update PYTHONPATH enviroment variable

      export PYTHONPATH='../src':$PYTHONPATH    
      
    • Run

      python3 ../src/data_cleanup.py -run_directory ./ -run_file TEMPLATE_data_cleanup.yml
      

Description of "run_parameters" file


Key Value Comments
pipeline_type gene_priorization_pipeline, samples_clustering_pipeline, geneset_characterization_pipeline Choose pipeline cleaning type
spreadsheet_name_full_path directory+spreadsheet_name Path and file name of user supplied gene sets
phenotype_full_path directory+phenotype_data_name Path and file name of user supplied phenotype data
gg_network_name_full_path directory+gg_network_name Path and file name of user supplied gene-gene network data
results_directory directory Directory to save the output files
redis_credential host, password and port Credentials to access gene names lookup
taxonid 9606 Taxon of the genes
source_hint ' ' Hint for lookup ensembl names
correlation_measure t_test/pearson Correlation measure to run gene_prioritization_pipeline

spreadsheet_name_full_path = TEST_1_gene_expression.tsv
phenotype_full_path = TEST_1_phenotype.tsv


Description of Output files saved in results directory


  • Output files

input_file_name_ETL.tsv.
Input file after Extract Transform Load (cleaning)

input_file_name_MAP.tsv.

(translated gene) (input gene name)
ENS00000012345 abc_def_er
... ...
ENS00000054321 def_org_ifi

input_file_name_UNMAPPED.tsv.

(input gene name) (unmapped-none)
abcd_iffe unmapped-none
... ...
abdcefg_hijk unmapped-none
Docker Pull Command
Owner
knowengdev