Hierarchical Graph Learning for Protein-Protein Interaction
HIGH-PPI runs on Python 3.7-3.9. To install all dependencies, directly run:
cd HIGH-PPI-main
conda env create -f environment.yml
conda activate HIGH-PPI
Download the following whl files to ./file/: torch-scatter, torch-sparse, torch-cluster, torch-spline-conv.
cd ./file
pip install torch_scatter-2.0.9-cp39-cp39-linux_x86_64.whl
pip install torch_sparse-0.6.13-cp39-cp39-linux_x86_64.whl
pip install torch_cluster-1.6.0-cp39-cp39-linux_x86_64.whl
pip install torch_spline_conv-1.2.1-cp39-cp39-linux_x86_64.whl
pip install torch-geometric
Three datasets (SHS27k, SHS148k and STRING) can be downloaded from the Google Drive:
protein.actions.SHS27k.STRING.pro2.txtPPI network of SHS27kprotein.SHS27k.sequences.dictionary.pro3.tsvProtein sequences of SHS27kprotein.actions.SHS148k.STRING.txtPPI network of SHS148kprotein.SHS148k.sequences.dictionary.tsvProtein sequences of SHS148k9606.protein.action.v11.0.txtPPI network of STRINGprotein.STRING_all_connected.sequences.dictionary.tsvProtein sequences of STRINGedge_list_12Adjacency matrix for all proteins in SHS27kx_listFeature matrix for all proteins in SHS27k
Example: predicting unknown PPIs in SHS27k datasets with native structures:
Download protein.actions.SHS27k.STRING.pro2.txt, protein.SHS27k.sequences.dictionary.pro3.tsv, edge_list_12, x_list and vec5_CTC.txt to ./HIGH-PPI-main/protein_info/.
Prepare all related PDB files. Native protein structures can be downloaded in batches from the RCSB PDB, and predicted protein structures with errors can be downloaded from the AlphaFold database. Put all of the PDB files in ./protein_info/.
Generate adjacency matrix with native PDB files:
python ./protein_info/generate_adj.py --distance 12
Generate feature matrix:
python ./protein_info/generate_feat.py
To predict PPIs, use 'model_train.py' script to train HIGH-PPI with the following options:
ppi_pathstr, PPI network informationpseq_pathstr, Protein sequencesp_feat_matrixstr, The feature matrix of all protein graphsp_adj_matrixstr, The adjacency matrix of all protein graphssplitstr, Dataset split modesave_pathstr, Path for saving models, configs and results- 'epoch_num' int, Training epochs
python model_train.py --ppi_path ./protein_info/protein.actions.SHS27k.STRING.pro2.txt --pseq ./protein_info/protein.SHS27k.sequences.dictionary.pro3.tsv --split random --p_feat_matrix ./protein_info/x_list.pt --p_adj_matrix ./protein_info/edge_list_12.npy --save_path ./result_save --epoch_num 500
Run 'model_test.py' script to test HIGH-PPI with the following options:
ppi_pathstr, PPI network informationpseq_pathstr, Protein sequencesp_feat_matrixstr, The feature matrix of all protein graphsp_adj_matrixstr, The adjacency matrix of all protein graphsmodel_pathstr, Path for trained modelindex_pathstr, Path for index being tested
python model_test.py --ppi_path ./protein.actions.SHS27k.STRING.pro2.txt --pseq ./protein.SHS27k.sequences.dictionary.pro3.tsv --p_feat_matrix ./x_list.pt --p_adj_matrix ./edge_list_12.npy --model_path ./result_save/gnn_training_seed_1/gnn_model_valid_best.ckpt --index_path ./train_val_split_data/train_val_split_1.json
The output after running 'model_test.py' includes:
valid_label_listReal PPI labels for the test indextest_pre_result_listPredicted PPI results for the test indexbest_f1Overall performance in terms of best-F1 scoreauprPerformance in terms of AUPR score for all seven PPI types (reaction, binding, ptmod, activation, inhibition, catalysis and expression)