Bioinformatics, 2023, 39(10), btad637
[Link]
Advance Access Publication Date: 17 October 2023
Original Paper
Data and text mining
Struct2GO: protein function prediction based on graph
Downloaded from [Link] by University College Oxford user on 22 March 2024
pooling algorithm and AlphaFold2 structure information
Peishun Jiao1, Beibei Wang1, Xuan Wang1,2, Bo Liu3,4, Yadong Wang3,4, Junyi Li 1,2,4,
*
1
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guang Dong 518055, China
2
Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology (Shenzhen), Shenzhen,
Guangdong 518055, China
3
Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
4
Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
*Corresponding author. School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guang Dong 518055,
China. E-mail: lijunyi@[Link]
Associate Editor: Jonathan Wren
Abstract
Motivation: In recent years, there has been a breakthrough in protein structure prediction, and the AlphaFold2 model of the DeepMind team has
improved the accuracy of protein structure prediction to the atomic level. Currently, deep learning-based protein function prediction models usu-
ally extract features from protein sequences and combine them with protein–protein interaction networks to achieve good results. However, for
newly sequenced proteins that are not in the protein–protein interaction network, such models cannot make effective predictions. To address
this, this article proposes the Struct2GO model, which combines protein structure and sequence data to enhance the precision of protein func-
tion prediction and the generality of the model.
Results: We obtain amino acid residue embeddings in protein structure through graph representation learning, utilize the graph pooling algorithm
based on a self-attention mechanism to obtain the whole graph structure features, and fuse them with sequence features obtained from the pro-
tein language model. The results demonstrate that compared with the traditional protein sequence-based function prediction model, the
Struct2GO model achieves better results.
Availability and implementation: The data underlying this article are available at [Link]
1 Introduction the features of the given protein and mapping it to the space
of protein function labels. A variety of data sources can be
As the expression products of genes and macromolecules in
organisms, proteins are the main material basis of life activi- tapped to obtain protein function prediction features, such as
ties, widely existing in various cells, providing many functions protein sequence, protein structure, protein family, and pro-
such as catalysis, cell signal, and structural support, playing a tein–protein interaction network, etc. The most commonly
key role in life activities and functional execution. At the same used information source is protein sequence and interaction
time, the study of proteins can better grasp life activities on a network. The protein function labels can be standardized
molecular level, which has important practical significance through The Gene Ontology Consortium (2017), which is a
for the management of diseases, the creation of new medica- database established by the Gene Ontology Consortium to de-
tions and the improvement of crops. Because of the progress- fine and describe genes and their products. According to dif-
ing high-throughput sequencing technology, protein sequence ferent functional scopes, Gene Ontology includes three
data are increasing exponentially. At present, more than independent branches: Cellular Component, Molecular
100 000 proteins have been obtained by biological experi- Function and Biological Process.
ments in the Universal Protein (UniProt) (UniProt Consortium Generally, the study of protein function prediction can be
2018) database with standard functional annotations. This separated into three stages. The initial step is the classic
accounts for only 0.1% of the proteins in the UniProt data- sequence-based method, such as BLAST (Altschul et al.
base. However, the method of verifying protein functions 1990), which calculates the similarity between protein sequen-
based on biological experiments is time-consuming and labor- ces and transfers annotations between proteins with similarity
intensive and has strict requirements on equipment and funds, scores exceeding a certain threshold. This method has great
which cannot meet the increasing annotation demand, so it is limitations in the prediction of protein functions without se-
necessary to design an efficient protein function prediction quence similarity. The second stage is the machine learning
method. method based on a decision tree and support vector machine,
The protein function prediction problem can be viewed as a of which the representative is the multi-source k nearest
multi-label binary classification problem, that is, by extracting neighbors (Lan et al. 2013) algorithm, which integrates
Received: 12 April 2023; Revised: 5 October 2023; Editorial Decision: 12 October 2023; Accepted: 16 October 2023
C The Author(s) 2023. Published by Oxford University Press.
V
This is an Open Access article distributed under the terms of the Creative Commons Attribution License ([Link] which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 Jiao et al.
multiple similarity measurement methods to find the k nearest nature (Dawson et al. 2017, Mitchell et al. 2019). Hence, de-
neighbors of the current protein, and the annotation of the spite disparate sequences, two proteins with analogous struc-
current protein is determined by calculating the weighted av- tures may possess analogous functions (Brenner et al. 1996,
erage of the function of its neighboring proteins. In 2018, the Holm and Sander 1996, Krissinel 2007, Sebastian and
DeepGO (Kulmanov et al. 2018) model proposed by Contreras-Moreira 2013). It is imperative to create techniques
Kulmanov et al. was the initial application of deep learning to that utilize protein structural data to anticipate functions to
protein function prediction, learning features from the protein compensate for the disparity between protein sequence and
sequence matrix through convolutional neural networks, and function. DeepFRI (Gligorijevic et al. 2021) has demonstrated
combining the embedding vectors of protein nodes in the PPI encouraging outcomes in the annotation of protein functions
Downloaded from [Link] by University College Oxford user on 22 March 2024
network for function prediction, and then enter the third through the utilization of experimentally determined protein
phase of deep learning models. The following year, the team structural databases. Although only a limited number of pro-
proposed the DeepGOPlus (Kulmanov and Hoehndorf 2020) teins have experimental structures, AlphaFold2 (Jumper et al.
model, which does not rely on the embedding vectors of pro- 2021) has achieved a remarkable advancement in protein
tein nodes in the protein–protein interaction network, but structure prediction, attaining an unprecedented level of accu-
captures sequence similarity information through the dia- racy in the prediction of protein structures at the atomic level,
mond (Buchfink et al. 2015) sequence alignment tool and and in most cases has shown accuracy comparable to experi-
combines CNN to extract sequence features to improve pre- ments, and has public released 214 million protein structure
diction performance. DeepGraphGO (You et al. 2021) lever- information, including humans, which will further promote
ages the family and domain information of the sequence to the development of methods for predicting protein functions
provide the nodes with initial characteristics and then utilizes using structure.
graph convolutional networks to acquire the structural infor- In this article, Struct2GO, a protein function prediction
mation of the PPI network. Building on this, PSPGO (Wu model that leverages multi-source data fusion, is proposed as
et al. 2022) proposed a multi-species label and feature propa- shown as Fig. 1. Specifically, the model takes protein sequence
gation model based on a protein sequence similarity network information and protein structure information as inputs
and PPI network. extracts sequence features through the SeqVec pre-training
All of the above methods use protein sequence as the infor- model and extracts structural features through the hierarchi-
mation source to predict GO terms, however, simply utilizing cal graph pooling model based on the self-attention mecha-
protein sequence information cannot reveal the correlation nism. To maximize the utilization of the protein structure
between protein functions. And models that obtain homolo- information provided by AlphaFold2, the residue-level em-
gous sequence features based on the PSSM method exhibit bedding is pre-trained in the protein structure network
lower sensitivity to single amino acid substitutions (Arya et al. via Node2vec, which is then employed as the initial node fea-
2022). Structure determines function is a universal rule in ture of the pooling model. Numerous experiments have
Figure 1. The Struct2Go model graph. The model’s input includes protein structure and protein sequence. In the preprocessing stage, the protein three-
dimensional structure is transformed into a protein contact graph, and amino acid-level embedding is generated through Node2vec. At the same time,
based on SeqVec, the protein sequence features are extracted and dimensionality is reduced to 1 1024. Then, through two layers of the self-attention
graph pooling model, protein structure features are extracted, in which GCN aggregates neighbor information and generates node weights, Top-rank
algorithm selects nodes according to weight values, and updates node features to generate subgraphs, and accumulates the feature values of the two
readout layers as the output of protein structural features. Finally, the sequence and structure features of the protein are fused as the input of the
classifier.
Struct2GO 3
demonstrated that the protein function prediction model given the current vertex v, the likelihood of visiting the subse-
which combines structure and sequence can significantly en- quent vertex x is
hance prediction accuracy. Simultaneously, the model elimi-
(p
nates the restrictions of the PPI network on feature extraction, vx
if ðv; xÞ 2 E
thereby significantly improving the model’s generalizability. Pðci ¼ x j ci1 ¼ vÞ ¼ Z ; (1)
0 otherwise
2 Materials and methods where pvx is the unnormalized transition probability between
2.1 Datasets vertex v and vertex x, and Z is the normalization constant.
Downloaded from [Link] by University College Oxford user on 22 March 2024
In this experiment, we obtained human protein structure data Node2vec introduces two hyperparameters p and q to regu-
predicted by AlphaFold2 from the EMBL-EBI database, in- late the random walk strategy. Assuming that the current ran-
cluding 23 391 protein structures. In this article, more than dom walk passes through (t, v) to reach vertex v,
560 000 data were screened from the gene ontology annota- let pvx ¼ apq ðt; xÞ wvx ,wvx is the edge weight between vertex
tion labels corresponding to human proteins, and the annota- v and x.
tions obtained by experiments, that is, evidence codes 8
(Evidence Code) of “IDA,” “IPI,” “EXP,” “IGI,” “IMP,” >
> 1
>
> if dtx ¼ 0
“IEP,” “IC,” or “TA” were extracted, among which the hu- <p
man dataset included 20 395 data. Concurrently, we down- apq ðt; xÞ ¼ 1 if dtx ¼ 1 ; (2)
>
> 1
loaded and parsed the most recent gene ontology data >
>
: if dtx ¼ 2
released by the official gene ontology website, construct the q
directed acyclic graph of gene ontology according to the terms
of BPO, CCO, and MFO branches parsed, and complete the where dtx is the shortest path distance between vertex t and
labels according to the above true path rules. It should be vertex x.
noted that most of the functional terms do not appear in the In terms of a specific implementation, this experiment is
dataset or only annotate a few proteins, so this article filters based on the open-source distributed machine learning plat-
out gene ontology nouns with a frequency lower than a cer- form Spark-On-Angel of Tencent and uses the efficient data
tain threshold for each branch to reduce the sparsity of labels. storage, update, and sharing services provided by Spark to im-
After completion, the number of BPO, MFO, and CCO plement the node2vec algorithm for graph computing. In the
labels is 809, 273, and 298 respectively (see Supplementary proteins we input, the number of amino acid residues is below
Table S3). 1500, so we choose the length of the walk-in node2vec to be
30, p to be 0.8, q to be 1.2, combined with the one-hot encod-
2.2 Protein representation ing of each residue, and finally, we generate 1 50 dimen-
2.2.1 The construction of protein contact map sional feature vectors for each residue in the protein.
Protein structure and function are closely related. To better
infer protein-related functions from protein structure informa- 2.2.3 Extraction of protein sequence features
tion, we transform the three-dimensional protein structure In the natural language domain, there has been a rapid ad-
into a two-dimensional protein contact map, construct a pro- vancement of pre-trained models such as Bert (Devlin et al.
tein structure network to aggregate the information of adja- 2018) and XLNet (Yang et al. 2019) in recent years, and
cent residues, and finally obtain the protein structure features. many researchers have extended the models in the NLP field
In terms of a specific implementation, we can obtain the to the bio-sequence field, proposing a variety of pre-trained
three-dimensional atomic coordinates of the protein structure models for obtaining distributed representations of protein
through AlphaFold2, and then calculate the relative distance sequences, and the SeqVec model (Heinzinger et al. 2019) is
between amino acid residues. If the Ca atom between them is widely employed among them. The SeqVec pre-trained model
less than 10 Å, it is considered that there is an edge directly can extract semantic information related to function from the
connected between the two residues. We employed two dis- sequence and has achieved good results in tasks such as pro-
tinct methods for generating contact maps: ANY-ANY and tein subcellular localization, secondary structure prediction,
NBR, see Supplementary Fig. S1 for the relevant experimental and functional prediction. Specifically, the SeqVec model uses
results. the CharCNN (Zhang et al. 2015) algorithm to acquire local
characteristics of amino acids, and then uses the BiLSTM al-
2.2.2 Obtaining amino acid residue level features gorithm to construct the language model. The single amino
In the protein structure network, each node is an amino acid acid feature is obtained by averaging the field features and the
residue. To obtain the features of the node, the most intuitive language model. That is, for the kth amino acid, its represen-
method is to use the one-hot encoding of 20 different amino tation is
acids, but this method cannot capture the position informa-
tion of the same amino acid in different protein networks. SeqVeck ¼ xLM LM LM
k þ hk;1 þ hk;2 (3)
Therefore, we utilize graph representation learning to acquire
the structural information of the node in the protein network.
Among the current algorithm, DeepWalk (Perozzi et al. 2014) !LM LM
is one of the most representative algorithms, which extends hLM ¼ h k;j ; h k;j ; (4)
k;j
the word2vec (Mikolov et al. 2013) idea, and suppose that
neighboring nodes have analogous embedding vectors. where xLM is the 1024-dimensional character features output
k
Node2vec (Grover and Leskovec 2016) is optimized through !LM LM
a biased random walk to acquire the successive vertices, i.e. by the CharCNN layer, and h k;j ; h k;j represents the
4 Jiao et al.
512-dimensional vectors output in a forward and backward Then we adopted the node selection algorithm proposed by
of LSTM layers, respectively. These two output vectors are Cangea et al. (2018), which retains some nodes and edges of
concatenated to form a 1024-dimensional feature hLM k;j as the
the input graph and generates a new subgraph as the input of
resultant of the j-layer BiLSTM model. Finally, the SeqVec the next layer. The pooling ratio k determines the number of
model concatenates residue-level features into a nodes that will be retained, and we select dkN e nodes by the
1024 N matrix and reduces the dimensionality of the ma- importance scores of each node obtained from the self-
trix through principal component analysis or average aggre- attention convolutional layer. In the application of the model,
gation to generate a 1 1024 matrix. we use the two-head attention mechanism to obtain two impor-
Downloaded from [Link] by University College Oxford user on 22 March 2024
In terms of the specific implementation, this experiment tance scores for each node respectively and calculate the mean
uses the SeqVec model, which first pre-trains about 33M as the ultimate score. In the experiment, this method effectively
sequences in the UniRef50 database. Then the human protein improves the performance of the model.
sequences are taken as input. For each protein sequence, we
can get a feature vector as the protein sequence feature, which idx ¼ toprank Z; dkN e Zmask ¼ Zidx ; (6)
is combined with its structural features in the subsequent
model for downstream protein function prediction.
Xout ¼ X0 Zmask Aout ¼ Aidx;idx ; (7)
2.3 Model and implementation
Since the same protein may have multiple functions, the where X0 is the original feature of the retained node, Xout is
model is essentially a multi-label classification task. In this ar- the generated feature of the retained node, Zmask is the impor-
ticle, an attention-based graph pooling mechanism is adopted, tance score of the retained node, and Aout is the adjacency ma-
which takes the above-obtained protein contact graph and trix of the subgraph generated by the retained node.
amino acid residue features as input extract protein structural
features through graph convolution and hierarchical pooling 2.3.3 Readout layer
and integrates the above sequence features as the input of the Xu et al. (2018) proved in the paper that in the field of graph
downstream protein function prediction multi-label classifier. classification, compared with mean-pooling and max-
At the same time, the network layer and post-processing layer pooling, sum-pooling shows better results. In sum-pooling, all
in the classifier ensure the hierarchical relationship between node features in the graph are summed up, which can learn
GO labels. all labels and extract more information. In our hierarchical
pooling model, we extract the graph features of this layer by
2.3.1 Convolution layer splicing sum-pooling and max-pooling, and finally, sum the
In this stage, we take the protein contact graph as the adja- graph features of multiple layers as the structural features of
cency matrix and the amino acid residue features as the node the protein. The formula for each layer graph pooling is as
feature in the graph and propagate its features between resi- follows:
dues with similar structures and structures through graph
convolution. We explored several widely used graph convolu- 1X N
tion functions, including Kipf and Welling graph convolu- S¼ Xi jjmaxXi ; (8)
N i¼1
tional layer (GraphConv) (Kipf and Welling 2016),
Chebyshev spectral graph convolutions (ChebConv)
where N is the number of nodes in this layer, Xi represents
(Defferrard et al. 2016), SAmpLe and aggregate convolutions
the feature of the ith node, and jj represents the feature
(SAGEConv) (Hamilton et al. 2017), and Graph Attention
splicing.
(GAT) (Velickovic et al. 2017). We compared the effects of
different graph convolution methods on the results, and the
experimental findings revealed that the two-layer GraphConv 3 Experiment and results
model attained the highest level of success. In each layer, a 3.1 Experiment
new hidden representation is obtained through neighbor mes-
To validate the effectiveness of Struct2Go, we divided the hu-
sage propagation and aggregation:
man protein dataset into training set, validation set, and test
set in a ratio of [Link] respectively to conduct experiments
1 1
hðlþ1Þ ¼ r D~ 2 AD
~ 2 hðlÞ H ; (5) with three different prediction models. We compared the pre-
dicted results of the test set with those of the current main-
where hðl0 Þ is the representation of the lth layer nodes, stream models, including Naı̈ve, BLAST, DeepGO,
H 2 RFF 0is the learnable convolutional 0 weights, DeepGOA, DeepFRI, and GAT-GO. Naı̈ve algorithm anno-
D~ 2 RNN is the A ~ degree matrix, and A
~ 2 RNN is the ad- tates GO terms according to the frequency, and BLAST is a
jacency matrix with self-connections. protein sequence comparison technique that utilizes sequence
similarity and dynamic programming to predict gene labels.
2.3.2 Self-attention graph hierarchical pooling layer DeepGO leverages both protein sequence information and
In recent years, the self-attention mechanism has been exten- PPI network data to infer gene ontology tags. DeepGOA in-
sively employed in deep learning models, resulting in notewor- novatively introduces GCN to obtain knowledge guidance
thy outcomes, and allowing the model to focus more on prediction in GO, DeepFRI transforms protein three-
significant features. Lee et al. (2019) introduced the self- dimensional structure into a contact map and uses GCN to
attention method to the graph pooling model, and obtained im- extract structural features for protein function prediction, and
portance scores of each node by stacking convolutional layers GAT-GO changes the aggregation function GCN to GAT
and transforming the output features into one-dimensional. based on DeepFRI and verifies it through experiments.
Struct2GO 5
In this article, AUC, AUPR, and Fmax are selected as met- mutations. Furthermore, our ablation experiment results also
rics to evaluate the accuracy of protein function prediction support the perspective of the conclusions drawn by Arya
from different perspectives. et al. (2022) indirectly.
(The definition of specific parameters and formulas can be
found in the Supplementary Data.) From Table 1, it is observ- 3.3 Model analysis
able that our model has achieved a considerable enhancement We compare the different variants of each component in the
in multiple metrics, which can be attributed to our processing model and verify through experiments that our model
and model design for the protein dataset when compared to achieves the best results in each variant, as shown in Table 3.
When extracting structural features from the protein contact
Downloaded from [Link] by University College Oxford user on 22 March 2024
other prevalent models. We fully mined the protein structure
information provided by AlphaFold2 and combined it with graph, we use four different aggregation functions,
the sequence feature method to achieve good results in protein GraphConv, ChebConv, GATConv, and SAGEConv, respec-
function prediction. At the same time, we also see that in all tively. The experimental data reveal that the various aggrega-
branches, the MFO branch has good prediction results, while tion functions have a minimal influence on the model
the BPO branch has lower accuracy, which may be related to performance, but GraphConv often achieves better results in
the number of labels in different branches. For a fair evalua- all data. Next, we compare the effects of SumPool, AvgPool,
tion of the model, metrics for all GO labels are provided and and MaxPool on model performance when reading graphs.
a histogram is plotted as shown in Supplementary Fig. S2. As Xu et al. (2018) stated, the SumPool method can accumu-
The metrics of the training set and test sets can be seen in late more features and often achieve better results in tasks
Supplementary Table S4. that distinguish graph structures. The Struct2GO model dem-
onstrates that hierarchical graph pooling is more effective
3.2 Ablation study than global graph pooling, likely due to its capacity to effi-
Then, we perform ablation experiments to assess the impact ciently extract pertinent information from protein contact
of each component in the Struct2GO model on the enhance- graphs with a large number of nodes. Finally, we contrasted
ment of performance, as shown in Table 2. the outcomes of single-layer and double-layer self-attention
The experiments involved extracting protein semantic fea- layers when utilizing hierarchical pooling. The experimental
tures from individual sequences using the SeqVec pre-trained results show that multi-head attention layers can learn more
model, obtaining contact maps based on AlphaFold2’s effective information and often perform better in experiments.
atomic-level protein three-dimensional coordinates, and
extracting protein structural features through hierarchical 3.4 Parameter sensitivity analysis
graph pooling. From the experimental data, it can be seen Then, we examined the effects of parameters such as dropout,
that the removal of any component will lead to the loss of learning rate, pooling ratio, and conv number on the model.
model performance, which fully proves that all components We employed the control variable method, varying a single
of our model are effective. The ablation experiments reveal parameter at a time for multiple comparison experiments, and
that, compared to protein semantic features obtained solely evaluated the actual impact of the parameter on the model
from single sequences, protein structural features have a sig- performance by observing the performance of the model after
nificant impact on downstream function prediction tasks. training, to identify the optimal parameter value. The scope
Analogous to the findings of Arya et al. (2022), structural- of hyperparameter comparison experiments is presented in
based features are more effective in capturing amino acid Table 4.
Table 1. Experimental results on human protein data.
Model BPO CCO MFO
Fmax AUC AUPR Fmax AUC AUPR Fmax AUC AUPR
Naı̈ve 0.347 0.501 0.568 0.571 0.477 0.372 0.336 0.498 0.532
BLAST 0.339 0.577 0.489 0.441 0.563 0.269 0.411 0.623 0.461
DeepGO 0.327 0.639 0.571 0.589 0.695 0.448 0.404 0.760 0.625
DeepGOA 0.385 0.698 0.622 0.629 0.757 0.500 0.477 0.820 0.710
DeepFRI 0.425 0.732 0.635 0.624 0.779 0.641 0.542 0.881 0.763
GAT-GO 0.462 0.586 0.512 0.647 0.831 0.681 0.633 0.912 0.776
Struct2GO 0.481 0.873 0.661 0.658 0.942 0.763 0.701 0.969 0.796
Table 2. Ablation experiment results on human protein data.
Methods BPO CCO MFO
Fmax AUC AUPR Fmax AUC AUPR Fmax AUC AUPR
Without structure 0.361 0.788 0.427 0.544 0.886 0.680 0.422 0.863 0.634
Without one-hot 0.438 0.854 0.609 0.625 0.934 0.727 0.648 0.947 0.752
Without Node2vec 0.430 0.850 0.602 0.584 0.925 0.714 0.636 0.946 0.694
Without sequence 0.429 0.851 0.595 0.579 0.924 0.705 0.594 0.945 0.707
Struct2GO 0.481 0.873 0.661 0.658 0.942 0.763 0.701 0.969 0.796
6 Jiao et al.
Table 3. Model comparison experiment results on human protein dataset.
Methods BPO CCO MFO
Fmax AUC AUPR Fmax AUC AUPR Fmax AUC AUPR
Sturct2GO-GraphConv 0.481 0.873 0.661 0.658 0.942 0.763 0.701 0.969 0.796
Sturct2GO-ChebConv 0.465 0.868 0.745 0.637 0.938 0.719 0.665 0.952 0.695
Sturct2GO-GATConv 0.457 0.869 0.749 0.623 0.931 0.703 0.678 0.953 0.705
Sturct2GO-SAGEConv 0.471 0.868 0.735 0.642 0.937 0.713 0.683 0.955 0.702
Struct2GO-SumPool 0.481 0.873 0.661 0.658 0.942 0.763 0.701 0.969 0.796
Downloaded from [Link] by University College Oxford user on 22 March 2024
Struct2GO-AvgPool 0.358 0.786 0.627 0.544 0.890 0.686 0.404 0.838 0.503
Struct2GO-MaxPool 0.457 0.864 0.731 0.633 0.936 0.720 0.667 0.953 0.696
Struct2GO-Hierarchical 0.481 0.873 0.661 0.658 0.942 0.763 0.701 0.969 0.796
Struct2GO-Global 0.364 0.789 0.613 0.542 0.890 0.683 0.402 0.838 0.601
Struct2GO-2_layer_attention 0.481 0.873 0.661 0.658 0.942 0.763 0.701 0.969 0.796
Struct2GO-1_layer_attention 0.456 0.867 0.745 0.629 0.935 0.728 0.634 0.942 0.672
Table 4. Range of hyperparameter comparison experiments. optimal value. Ultimately, taking into account both the num-
Hyperparameter Range
ber of training cycles and the model performance, we set the
learning rate to 0.0001.
Dropout 0.3, 0.35, 0.4, 0.45, 0.5 From the experimental data depicted in Supplementary Fig.
Learning rate 0.1, 0.01, 0.001, 0.0001 S5, when the convolution number is 1, it means that we can
Pooling ratio 0.25, 0.5, 0.75
only learn the features of the direct neighbors. As the convolu-
Conv number 1, 2, 3, 4
tion number increases, the nodes in the graph can learn more
features of the indirect neighbors, but at the same time, it will
also lead to the problem of overfitting. The experimental data
reveal that the performance of various convolution numbers
is only slightly dissimilar, and the model achieves the optimal
performance when the convolution number is 2.
The pooled ratio represents the ratio of the number of
nodes in the subgraph generated in the next layer in the hier-
archical process to the original graph, that is, the pooling ra-
tio. In Supplementary Fig. S6, if the pooling ratio is 1, it
degenerates to global pooling. From the comparison of the ex-
perimental results in the graph, we find that the model perfor-
mance is better when the pooling ratio is 0.75. When the
pooling ratio is reduced to 0.25, the model performance has a
significant decrease, which may be because the reduction of
the number of nodes in the subgraph will affect the generali-
zation ability of the model, so we set the pool ratio value to
0.75.
Figure 2. PR curve of Struct2GO with different dropout. The curve of 4 Conclusion
different colors represents the influence of different dropout values on the
performance of the model. By observing the PR graph, it can be found
In this article, we propose a powerful end-to-end graph deep
that the model shows the best performance and stability when the learning model Struct2Go, which can effectively and quickly
dropout value is 0.3. annotate protein functions based on protein structure and se-
quence. Specifically, we adopt a graph pooling model to ac-
The utilization of dropout in deep neural networks can mit- quire structural features from the three-dimensional protein
igate overfitting and enhance the generalization capacity. structure predicted by AlphaFold2 and integrate the sequence
From the experiments in Fig. 2, it is evident that varying drop- features extracted by Seqvec to train the protein function clas-
outs have a negligible effect on the model performance. sifier. AlphaFold2 predicted three-dimensional protein struc-
Among them, when the dropout is 0.3, the model achieves ture data provides strong support for our functional
slightly better performance. Simultaneously, to expedite the prediction, which can enable us to abandon the constraints of
convergence rate of the model, we opted for a more prudent PPI networks in previous works and effectively improve the
dropout value of 0.3. generality of the model. At the same time, compared with the
From the experimental data depicted in Supplementary Fig. previous methods for predicting protein function based on ex-
S4, we observed that the model’s classification performance perimentally determined protein structure, AlphaFold2 pro-
was weakest when the learning rate was 0.01, indicating vides sufficient high-resolution structure information, which
that an excessively high learning rate could lead to the loss enables our model to perceive more homologous information
function fluctuating. When the learning rate decreased, the and effectively improve the accuracy of prediction. The com-
model’s convergence gradually improved, but this also neces- parative experiments demonstrate that Struct2Go has attained
sitated a greater number of training cycles to reach the the most advanced performance, thereby conclusively
Struct2GO 7
demonstrating the effective support of structural information Devlin J, Chang M W, Lee K et al. Bert: pre-training of deep bidirec-
for protein function prediction. tional transformers for language understanding. arXiv,
In our future work, we will continue to investigate novel arXiv:1810.04805, 2018, preprint: not peer reviewed.
Gligorijevic V, Renfrew PD, Kosciolek T et al. Structure-based protein
methods and enhance the generality and precision of the
function prediction using graph convolutional networks. Nat
Struct2Go model. In addition, the AlphaFold2 website pro- Commun 2021;12:3168.
vides us with 217 million protein structure datasets of multi- Grover A, Leskovec J. node2vec: Scalable Feature Learning for Networks.
ple species, which can be used in future research to try large- In: Proceedings of the 22nd ACM SIGKDD International Conference
scale cross-species protein function model training, which can on Knowledge Discovery and Data Mining. San Francisco, CA:
Downloaded from [Link] by University College Oxford user on 22 March 2024
effectively improve the generality of the model. Association for Computing Machinery, 2016, 855–64.
At the same time, in order to focus more on the influence of Hamilton W, Ying Z, Leskovec J. Inductive representation learning on
large graphs. Advn Neural Inform Process Syst 2017;30:1024–34.
subtle structural changes on protein function prediction in fu-
Heinzinger M, Elnaggar A, Wang Y et al. Modeling aspects of the lan-
ture work, we can explore new approaches in protein struc- guage of life through transfer-learning protein sequences. BMC
ture feature extraction. For instance, we can investigate Bioinformatics 2019;20:723–17.
embedding the amino acid features extracted from sequence Holm L, Sander C. Mapping the protein universe. Science 1996;273:
models into protein structural networks and explore novel 595–603.
random walk models to more comprehensively unearth valu- Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure pre-
diction with AlphaFold. Nature 2021;596:583–9.
able information within protein structures. In addition, we
Kipf TN, Welling M. Semi-supervised classification with graph convolu-
can also build a protein network based on structural similar- tional networks. arXiv, arXiv:1609.02907, 2016, preprint: not peer
ity, with a single protein as the node, and use the effective in- reviewed.
formation of homologous proteins in network propagation to Krissinel E. On the relationship between sequence and structure similari-
improve the accuracy of the model prediction. ties in proteomics. Bioinformatics 2007;23:717–23.
Kulmanov M, Hoehndorf R. DeepGOPlus: improved protein function
prediction from sequence. Bioinformatics 2020;36:422–9.
Supplementary data Kulmanov M, Khan MA, Hoehndorf R et al. DeepGO: predicting pro-
tein functions from sequence and interactions using a deep ontology-
Supplementary data are available at Bioinformatics online. aware classifier. Bioinformatics 2018;34:660–8.
Lan L, Djuric N, Guo Y et al. MS-kNN: protein function prediction by
integrating multiple data sources. BMC Bioinformatics 2013;
Conflict of interest 14(Suppl. 3):S8.
None declared. Lee J, Lee I, Kang J. Self-Attention graph pooling. In: Kamalika C,
Ruslan S (eds), Proceedings of the 36th International Conference on
Machine Learning. Long Beach, CA, US: Proceedings of Machine
Funding Learning Research (PMLR); 2019. pp. 3734–43.
Mikolov T, Chen K, Corrado G et al. Efficient estimation of word repre-
This work was supported by the grants from the National Key sentations in vector space. arXiv, arXiv:1301.3781, 2013, preprint:
Research and Development Program of China [2021YFA091 not peer reviewed.
0700]; Shenzhen Science and Technology University Stable Mitchell AL, Attwood TK, Babbitt PC et al. InterPro in 2019: improving
Support Program [GXWD20201230155427003-20200821222 coverage, classification and access to protein sequence annotations.
112001]; Shenzhen Science and Technology Program [JCYJ2 Nucleic Acids Res 2019;47:D351–60.
0200109113201726]; Guangdong Basic and Applied Basic Perozzi B, Al-Rfou R, Skiena S. DeepWalk: online learning of social repre-
sentations. In: Proceedings of the 20th ACM SIGKDD International
Research Foundation [2021A1515012461, 2021A1515220115];
Conference on Knowledge Discovery and Data Mining. New York:
and Guangdong Provincial Key Laboratory of Novel Security Association for Computing Machinery, 2014, 701–10.
Intelligence Technologies [2022B1212010005]. Sebastian A, Contreras-Moreira B. The twilight zone of cis element
alignments. Nucleic Acids Res 2013;41:1438–49.
The Gene Ontology Consortium. Expansion of the gene ontology
References knowledgebase and resources. Nucleic Acids Res 2017;45:D331–8.
UniProt Consortium. UniProt: a worldwide hub of protein knowledge.
Altschul SF, Gish W, Miller W et al. Basic local alignment search tool.
Nucleic Acids Res 2018;47:D506–15.
J Mol Biol 1990;215:403–10.
Velickovic P, Cucurull G, Casanova A et al. Graph attention networks.
Arya A, Mary Varghese D, Kumar Verma A et al. Inadequacy of evolu-
arXiv, arXiv:1710.10903, 2017, preprint: not peer reviewed.
tionary profiles vis-a-vis single sequences in predicting transient
Wu K, Wang L, Liu B et al. PSPGO: Cross-species heterogeneous net-
DNA-binding sites in proteins. J Mol Biol 2022;434:167640.
work propagation for protein function prediction. IEEE/ACM
Brenner SE, Chothia C, Hubbard TJP et al. Understanding protein struc-
Trans Comput Biol Bioinform 2023;20:1713–24.
ture: using scop for fold interpretation. Methods Enzymology 1996; Xu K, Hu W, Leskovec J et al. How powerful are graph neural net-
266:635–43. works? arXiv, arXiv:1810.00826, 2018, preprint: not peer
Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment us- reviewed.
ing DIAMOND. Nat Methods 2015;12:59–60. Yang Z, Dai Z, Yang Yet al. Xlnet: generalized autoregressive pretrain-
Cangea C, Velickovic P, Jovanovic N et al. Towards sparse hierarchical ing for language understanding. Adv. Neural Inform. Process.
graph classifiers. arXiv, arXiv:1811.01287, 2018, preprint: not peer Systems 2019;32:5753–63.
reviewed. You R, Yao S, Mamitsuka H et al. DeepGraphGO: graph neural net-
Dawson NL, Lewis TE, Das S et al. CATH: an expanded resource to pre- work for large-scale, multispecies protein function prediction.
dict protein function through structure and sequence. Nucleic Acids Bioinformatics 2021;37:i262–71.
Res 2017;45:D289–95. Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for
Defferrard M, Bresson X, Vandergheynst P. Convolutional neural net- text classification. In: Proceedings of the 28th International
works on graphs with fast localized spectral filtering. Adv Neural Conference on Neural Information Processing Systems—Volume 1.
Inform. Process. Syst. 2016;29:3844–52. Montreal, Canada: MIT Press, 2015, 649–57.