This repository contains the implementation of the ShallowChrome modeling pipeline presented in the paper:
Frasca F., Matteucci M., Leone M., Morelli M. J. and Masseroli M. "Accurate and highly interpretable prediction of gene expression from histone modifications", 2022; 23: 151 available here.
ShallowChrome is a novel computational pipeline for accurate and fully interpretable modeling of epigenetic gene transcriptional regulation operated by Histone Mark (HM) modifications. ShallowChrome leverages on the procedure of 'peak calling' to retrieve gene-wise, significant and dynamically located HM features that can strongly predict the transcriptional state of genes. In our modeling pipeline we:
- Fit logistic regression models on these extracted features to solve the task of binary classification of gene transcriptional state over 56 cell-types from the REMC database;
- Analyse and rigorously interpret the obtained models by extracting insightful gene-specific regulative patterns;
- Compare the extracted patterns with the characteristic chromatin state emissions from ChromHMM (Ernst et al., 2012), showing that
ShallowChromeis able to coherently rank groups of chromatin states w.r.t. their transcriptional activity.
More on how to replicate paper results is in the following.
ShallowChrome/
|-- README.md
|-- LICENSE
|-- .gitignore
|-- notebooks/
| |-- utils.py
| |-- model fitting.ipynb
| |-- model inspection.ipynb
| |-- model validation.ipynb
| |-- model fitting - valley thresholding.ipynb
| |-- data extraction.ipynb
|-- scores/
| |-- DeepChrome_scores.txt
|-- data/
| |-- - splits/
| | |-- iteration_0/
| | |-- iteration_1/
| | |-- iteration_2/
| | ...
| |-- - targets/
| | |-- E003/
| | |-- E004/
| | ...
| |-- cells.csv
| |-- gene_list.txt
| |-- GeneFile.txt
| |-- names.csv
README.mdthis fileLICENSEMIT license file.gitignorestandard .gitignore file for Python projectsnotebooks/folder containing Python notebooks to run the modeling pipelinenotebooks/utils.pycore Python routines called from within the notebooks to perform modeling and analysesnotebooks/model fitting.ipynbnotebook where ShallowChrome models are fitted to solve binary classification of gene transcriptional state; reproduces Tables 2 and S1 and Figure 2 of the papernotebooks/model inspection.ipynbnotebook to inspect ShallowChrome models and to extract and plot gene-wise regulative patterns; reproduces Figure 3 of the papernotebooks/model validation.ipynbnotebook to compare ShallowChrome regulative patterns with ChromHMM chromatin state emissions; reproduces Figure 4 of the papernotebooks/model fitting - valley thresholding.ipynbhere the classification task is solved with an alternative approach to define target classes; reproduces Table S4 and Figures S2 and S3 of the papernotebooks/data extraction.ipynbnotebook to perform data extraction with pygmql and pandasscores/default folder where numerical results from the modeling pipeline are storedscores/DeepChrome_scores.txttest scores from the DeepChrome model (Singh et al., 2016)data/default folder where data and reference files are storeddata/- splits/folder containing random split indices for model fittingdata/- targets/folder containing RPKM target values for each epigenomedata/cells.csvcsv file enumerating the 56 epigenomes object object of the present studydata/gene_list.txt/txt file containing the ordered list of genes considered in the present studydata/GeneFile.txttxt file containing promoter window coordinates for each of the considered genesdata/names.csvcsv file enumerating the Histone Mark modifications considered in the present study
In order to run the ShallowChrome model fitting and analyses (notebooks model fitting.ipynb, model inspection.ipynb, model validation.ipynb, model fitting - valley thresholding.ipynb), the following libraries are required:
- matplotlib
- numpy
- scikit
- scipy
- jupyter
We suggest installing them within a Python virtual environment via pip. Paper results can be reproduced with the following versions on
Python 2.7.15:
matplotlib==2.2.4
numpy==1.16.6
scikit-learn==0.20.4
scipy==1.2.2
In order to run the de novo data extraction notebook (data extraction.ipynb) the following libraries are required:
- numpy
- gmql
- pandas
- jupyter
We employed the following versions on
Python 3.6.12
numpy==1.16.0
gmql==0.1.1
pandas==1.1.5
NB: gmql additionally requires Java. Please follow the installation procedure here.
- Download the pre-processed data from here;
- Uncompress the downloaded ".zip" file in folder "data";
- Run the
notebooks/model fitting.ipynbnotebook to reproduce Tables 2 and S1 and Figure 2 of the paper; - Run the
notebooks/model inspection.ipynbnotebook to reproduce Figure 3 of the paper; - Run the
notebooks/model inspection.ipynbnotebook with variabletarget_onlyset toFalse: this will perform model selection and fitting for all epigenomes over the 'standard' DeepChrome split, dumping all fitted models to disk; - Run the
notebooks/model validation.ipynbnotebook to reproduce Figure 4 of the paper; - Run the
notebooks/model fitting - valley thresholding.ipynbnotebook to reproduce Table S4 and Figures S2 and S3 of the paper. Alternatively: - Run the
notebooks/data extraction.ipynbto prepare all necessary data for model fitting and analyses; - Go to step 3. above.