Skip to content

protddg-bench/protddg-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking resource for predictors of protein stability change.

INTRODUCTION

  Emidio Capriotti, Ludovica Montanucci and Piero Fariselli, 2023.
  Scripts are licensed under the Creative Commons by NC-SA license.

  ProtDDG-Bench is a benchmarking resource for predictors of protein stability change. 

DATASETS

  All the datasets for testing the performance of the predictors of DDG upon mutation are available at
  the protddg-bench repository. The protddg-bench repository includes the following datasets:

  1. VB1432:    1432 variants from 79 protein structures corresponding to 65 clusters.
                9 mutations have double experimental data. 1 mutation is not mapping 
                to the strcuture. The 1LRP structure was replace with 1LMB. For structure 1WQ5
                was considered the chain B. Data from PMID:29597263.
  2. S2648:     2648 variants from 132 protein strcutures corresponding to 113 clusters.
                Experimental DDGs of the same variants are avereged. Data from PMID:21569468.
  3. Ssym:      634 variants from 357 structures corresponding to 13 clusters.
                Dataset composed by 342 mutations and their reverse. From the original publication,
                the data of few mutants have been corrected.
                Original data from PMID:29718106.
  5. Broom:     Dataset composed by 605 mutations from 58 structures corresponding to 50 clusters.
                This dataset contains 53 mutations from non-native proteins and 59 mutations
                referring to fragment of the protein. Experimental data are duplicated and 
                triplicated in 17 and 1 cases. Data from PMID:28710274.

  4. Myoglobin: 134 variants from myoglobin from structure 1BZ6. Experimental data are 
                duplicated and triplicated in 14 and 3 cases respectively. The variation of                    
                unfolding free energy (DDG) was calculated changing the sign of the DDG
                reported in literarture. Data from PMID:26054434
                
  5. P53:       42 variants from P53 structure 2OCJ. Data from PMID:24281696.

  6. KORPM:     2,371 mutations from 129 protein families with sequence identity <25%
                and a testing set of 461 mutants. PMID:36629451.

  7. PTMUL:     914 multiple site variants from 91 protein structures and 77 clusters.
                PMID:31266447.


 - The directory S2648 and VB1432 contains 10 files for 10-folds cross-validation tests.
   Furthermore, the cross-validation subset of S2648 and VB1432 are consistent. This means that 
   the following predictions can be performed:

     Training: not SET_i vb1432-10fold-split-j.tsv -> Test: SET_i s2648-10fold-split-j.tsv
     Training: not SET_i s2648-10fold-split-j.tsv  -> Test: SET_i vb1432-10fold-split-j.tsv


 - The directory BROOM contains a 5-fold split of the BROOM dataset. Given the number of mutations
   mutations form the same cluster the set has been diveded in 5 subsets. 
   The test on this dataset can be performed as follow:

     Training: not SET_i train-vb1432-test-broom.tsv -> Test: SET_i broom-5fold.tsv
     Training: not SET_i train-s2648-test-broom.tsv  -> Test: SET_i broom-5fold.tsv


 - The directory SSYM contains a 5-fold split of the Ssym dataset. Given the large number of
   mutations form the same cluster the set has been diveded in 5 subsets. 
   The test on this dataset can be performed as follow:

     Training: not SET_i train-vb1432-test-ssym.tsv -> Test: SET_i ssym-5fold.tsv
     Training: not SET_i train-s2648-test-ssym.tsv  -> Test: SET_i ssym-5fold.tsv


 - The directory MYOGLOBIN test contains the testing dataset myoglogin.tsv with the 
   best subsets of VB1432 and S2648 to be used as possible training.
   The following prediction can be performed:

     Training: train-vb1432-test-myoglobin.tsv (1399) -> Test: myoglobin.tsv
     Training: train-s2648-test-myoglobin.tsv  (2607) -> Test: myoglobin.tsv


 - The directory P53 test contains the testing dataset p53.tsv with the best subsets 
   of VB1432 and S2648 to be used as possible training. 
   The following prediction can be performed:

     Training: train-vb1432-test-p53.tsv       (1427) -> Test: p53.tsv
     Training: train-s2648-test-p53.tsv        (2643) -> Test: p53.tsv


 - The directory KORPM contains 10 files for 10-folds cross-validation tests.
   Furtermore it contains 2 training and 2 testing files. The testing files are
   Ssym and S461.
   The tests on this dataset can be performed as follow:

     Training: not SET_i korpm-10fold-split-j.tsv -> Test: SET_i korpm-10fold-split-j.tsv
     Training: not Ssym  train-korpm-nossym.tsv  (1,807) -> Test: ssym-korpm.tsv
     Training: not S461  train-korpm-nos461.tsv  (2,224) -> Test: s461-korpm.tsv


 - The directory PTMUL contains files for testing predictions on multiple site mutations starting from  
   a training on a set single point mutations.
   The directory also includes a 5-fold split of the PTMUL dataset. Given the number of mutations
   mutations form the same cluster the set has been diveded in 5 subsets.
   The test on this dataset can be performed as follow:

     Training: not SET_i train-vb1432-test-ptmul.tsv -> Test: SET_i ptmul-5fold.tsv
     Training: not SET_i train-s2648-test-ptmul.tsv  -> Test: SET_i ptmul-5fold.tsv

CLUSTERING

 The file data/cluster-545-pdbchains.txt contains 132 clusters of 545 PDB chains. 
 The clustering is obtained using blastclust with the options -S 25 -L 0.5 -b F.
 
 The file data/cluster-129-korpm-pdbchain.txt contains 129 groups of proteins from the korpm dataset.
 The clustering is obtained using MMseq with 25% sequence identity cutoff.

TESTING

 To test your method you need to:
    1. replace the file scripts/predict-ddg-value.py with your own script that runs taking 
       in input only the testing and training files and returning in standard output 
       the experimental and the predicted ddgs respectively
       The program runs as follows:
            ./scripts/predict-ddg-value.py test_file.txt train_file.txt

    2. Generate an inputfile containing a two columns representing the PDB chain 
       identifier and the mutation followed by all the inputfeatures.
       The full list of mutations are reported in the file data/unique-mutations-input.txt
       and example of input file with two input features is data/ifeatures-KYTJ820101-BASU050101.txt.
 
 Finally run scripts/test.py input_feature_file.txt to score the performace of your method.
 For example runs:
       ./scripts/test.py ./data/ifeatures-KYTJ820101-BASU050101.txt

About

Scoring methods for predicting the DDG upon protein mutation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages