2022 07 21 500999v1 Full
2022 07 21 500999v1 Full
Authors: Ruidong Wua,1, Fan Dinga,1, Rui Wanga,1, Rui Shena,1, Xiwen Zhanga, Shitong Luoa,
Chenpeng Sua, Zuofan Wua, Qi Xieb,Bonnie Bergerc,2, Jianzhu Maa,2, Jian Penga,2
Affiliations: aHelixon US Inc, USA; bWestlake Laboratory of Life Sciences and Biomedicine,
c
Hangzhou, Zhejiang, China; Computer Science & Artificial Intelligence Laboratory,
Massachusetts Institute of Technology, Cambridge, MA 02139
Abstract: Recent breakthroughs have used deep learning to exploit evolutionary information in
multiple sequence alignments (MSAs) to accurately predict protein structures. However, MSAs of
homologous proteins are not always available, such as with orphan proteins or fast-evolving
proteins like antibodies, and a protein typically folds in a natural setting from its primary amino
acid sequence into its three-dimensional structure, suggesting that evolutionary information and
MSAs should not be necessary to predict a protein’s folded form. Here, we introduce OmegaFold,
the first computational method to successfully predict high-resolution protein structure from a
single primary sequence alone. Using a new combination of a protein language model that allows
us to make predictions from single sequences and a geometry-inspired transformer model trained
on protein structures, OmegaFold outperforms RoseTTAFold and achieves similar prediction
accuracy to AlphaFold2 on recently released structures. OmegaFold enables accurate predictions
on orphan proteins that do not belong to any functionally characterized protein family and
antibodies that tend to have noisy MSAs due to fast evolution. Our study fills a much-encountered
gap in structure prediction and brings us a step closer to understanding protein folding in nature.
1
Equal Contribution
2
Correspondence
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Main Text:
Half a century after Anfinsen demonstrated a connection between a protein’s amino acid sequence
and folded three-dimensional conformation, scientists have built deep learning models that finally
can predict high-resolution protein structures. The recent success of DeepMind’s AlphaFold2 (1)
and RoseTTAFold (2) in protein structure prediction has mainly been based on advances in deep
learning, especially transformer-based models (3), and the accumulation of large databases of
protein sequences and structures that enable effective training of large models. Both these methods
need as input evolutionary data in the form of multiple sequence alignments (MSAs) of
homologous sequences aligned to the primary one, a technique which has been a staple of structure
prediction methods (1, 2, 4–15). By extracting residue-residue covariances from these MSAs, these
algorithms have been shown to greatly outperform previous approaches, including physics-based
models, homology-based methods and convolutional neural networks (16) and predict structures
with atomic-level accuracy for the first time in history. Furthermore, many methods have since
built upon AlphaFold2 for other prediction tasks in structural biology, including protein-protein
interactions, disordered regions, and binding sites (17–20). However, prediction accuracies for all
these advanced methods drop sharply in the absence of a multitude of sequence homologs from
which to construct MSAs.
We propose OmegaFold, to our knowledge the first computational method to predict the
structure of a protein from its primary sequence alone with high accuracy, using a new combination
of a large pretrained language model for sequence modeling and a geometry-inspired transformer
model for structure prediction (Fig. 1A). Notably, OmegaFold requires only a single amino-acid
sequence for protein structure prediction, does not rely on MSAs nor known structures as templates,
and scales roughly ten-times faster with comparable or better accuracy to MSA-based methods
such as AlphaFold2 and RoseTTAFold. We demonstrate OmegaFold’s ability to more accurately
predict the structures of orphan proteins and antibodies, for which evolutionary information is
scarce or noisy, respectively (Fig. 2).
The key idea behind Geoformer is to make the embeddings from our language model more
geometrically consistent—amino acid node and pairwise embeddings generate consistent
coordinates and distance predictions when projected to 3D. While similar in principle to the
Evoformer module in AlphaFold2 which applied attention mechanisms for information integration,
ours mainly focuses on vector geometry as opposed to evolutionary variation. It consists of a deep
stack of 50 Geoformer layers, which is inspired by the fundamental theorem of vector calculus in
geometry (Fig. 3B). Each Geoformer layer encodes information in node representations (𝑠! ) for
residue 𝑖 and pairwise representations (𝑝!" ) between residues 𝑖 and 𝑗 by enforcing their geometric
consistency. Intuitively, we can view the representation of residue 𝑖 in a high-dimensional vector
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
space, and each pairwise representation 𝑝!" is a vector pointing from residue 𝑖 to residue 𝑗, which
will be used for predicting three-dimensional coordinates and distances. Ideally, coordinates and
pairwise distances predicted from 𝑠! and 𝑝!" should satisfy properties of Euclidean geometry, such
as triangular inequality. However, these properties may not always hold if the representation
vectors and predictions are output from neural networks. Based on these geometric insights, we
designed a geometry-inspired nonlinear attention module to sequentially update 𝑠! and 𝑝!" ,
attempting to make them consistent. By stacking Geoformer layers upon the PLM, OmegaFold
captures the geometry of a protein structure with representations which are then projected onto 3D
space with an 8-layer structure module (Fig. 1A).
Intrigued by AlphaFold2’s training, we trained the entire OmegaFold model, with several
structural objectives, including contact prediction, Frame Aligned Point Error (FAPE) loss (1)—a
geometrically restricted version of root-mean-squared deviation (RMSD) of relative atomic
positions—and torsion angle prediction. The full model was jointly trained on ~110,000 single-
chain structures from the Protein Databank PDB (29–31) deposited before 2021 and all single
domains from the SCOP v1.75 database with at most 40% sequence identity cutoff (32–34). We
used later-released structures for validation and hyperparameter selection. We also excluded
protein structures that appeared in our test sets as well as their homologs up to 40% sequence
identity during training.
Since MSAs are no longer required for OmegaFold to achieve high-resolution performance,
its overall runtime is much faster than AlphaFold2, as well as the latest highly-optimized
ColabFold-AF2 (40) (Fig1. C).
In summary, our study leverages a protein language model trained on unaligned sequences
to predict protein structures from single amino acid sequences alone. We give further evidence
that such evolutionary information may well be encoded in primary sequences, which can then be
used as features for structure prediction. As more sequencing data accumulates to feed MSA-based
methods, OmegaFold fills the interim gap, as well as importantly predicts from sequences for
which MSAs are difficult to construct.
We expect that our conceptual advance and further algorithmic development along these
lines will continue to enable a wide spectrum of protein science applications, such as multi-state
conformational sampling, where AlphaFold2 has had some success to regions for which
predictions are challenging (41), variant effect prediction, protein-protein interactions and protein
docking.
Author contributions:
J.P., J.M., B.B., F.D., R.Wu developed the conceptual ideas. R.Wu, F.D., R.Wang and R.S.
implemented the main algorithms and finished model training. X.Z., R.Wu and R.Wang did data
preprocessing. C.S., Z.W. and R.Wu conducted baseline experiments. B.B., J.M., J.P., C.S.,
R.Wang, R.Wu interpreted the results. B.B., J.M., J.P., F.D., R.Wang, R.Wu, R.S., X.Z., S.L.,
C.S., Z.W., Q.X. wrote the manuscript.
Competing interests:
J.M., J.P., F.D., R.Wang, R.Wu, R.S., X.Z., S.L., C.S., Z.W.are from HeliXon Ltd.
Figs. S1 to S2
Tables S1 to S4
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
References
1. J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool,
R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A.
Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman,
E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D.
Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, D. Hassabis, Highly accurate
protein structure prediction with AlphaFold. Nature. 596, 583–589 (2021).
4. L. S. Johnson, S. R. Eddy, E. Portugaly, Hidden Markov model speed heuristic and iterative
HMM search procedure. BMC Bioinformatics. 11, 431 (2010).
10. M. J. Skwark, D. Raimondi, M. Michel, A. Elofsson, Improved contact predictions using the
recognition of protein like contact patterns. PLoS Comput. Biol. 10, e1003889 (2014).
11. S. Wang, S. Sun, Z. Li, R. Zhang, J. Xu, Accurate DE Novo prediction of protein contact map
by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017).
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
12. D. T. Jones, S. M. Kandathil, High precision in protein contact prediction using fully
convolutional neural networks and minimal sequence features. Bioinformatics. 34, 3308–
3315 (2018).
14. M. Steinegger, J. Söding, MMseqs2 enables sensitive protein sequence searching for the
analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
20. C. J. Wilson, W.-Y. Choy, M. Karttunen, AlphaFold2: A role for disordered protein/region
prediction? Int. J. Mol. Sci. 23, 4591 (2022).
21. J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, in Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers) (Association for Computational Linguistics,
Minneapolis, Minnesota, 2019), pp. 4171–4186.
23. A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma, R.
Fergus, Biological structure and function emerge from scaling unsupervised learning to 250
million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 118 (2021),
doi:10.1073/pnas.2016239118.
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
25. T. Bepler, B. Berger, Learning the protein language: Evolution, structure, and function. Cell
Syst. 12, 654-669.e3 (2021).
26. S. Sledzieski, R. Singh, L. Cowen, B. Berger, D-SCRIPT translates genome to phenome with
sequence-based, structure-aware, genome-scale predictions of protein-protein interactions.
Cell Syst. 12, 969-982.e6 (2021).
28. B. Hie, E. D. Zhong, B. Berger, B. Bryson, Learning the language of viral evolution and
escape. Science. 371, 284–288 (2021).
31. Crystallography: Protein data bank. Nat. New Biol. 233, 223–223 (1971).
34. A. Andreeva, E. Kulesha, J. Gough, A. G. Murzin, The SCOP database in 2020: expanded
classification of representative family and superfamily domains of known protein structures.
Nucleic Acids Res. 48, D376–D382 (2020).
35. W. Hua, Z. Dai, H. Liu, Q. Le, in Proceedings of the 39th International Conference on
Machine Learning, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, S. Sabato, Eds.
(PMLR, 17--23 Jul 2022), vol. 162 of Proceedings of Machine Learning Research, pp. 9099–
9117.
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
A Recycling
E V Q S G A
Single Primary Protein
Sequence Only Structure
Language Geoformer
(50 layers) Module
No MSA Model (8 layers)
(OmegaPLM)
EVQSGAEAGSQV
EVQSGA-AGSQV Predicted
EVQSGAESAGQE Structure
RHKD E E E
Residue k Residue
E S T N MASK
V V
MASK V Embedding k Embedding
i i
QCUG Q Q Q
Attention
E V Q S G A E V Q S G A
APVI S S S E
j kj Geometric j E
V Smoothing kj V
Pretrained with
LMG
Sequence F G G G Q
Pair ij ij Q
Pair
Database k k
S Embedding S Embedding
VILM A A A G G
i ik i
A ik A
FYWA Q Q Q
B Native
OmegaFold
CAMEO 7DKI:A Native
OmegaFold
CAMEO 7EBQ:A Native
OmegaFold
CAMEO 7ED6:A
CAMEO
LDDT
Single Seq.
OmegaFold 0.82
AlphaFold2 0.34
RoseTTAFold 0.14
Input
AlphaFold2 0.86
MSA
RoseTTAFold 0.75
LDDT: 0.93 TM-score: 0.98 LDDT: 0.78 TM-score: 0.95 LDDT: 0.91 TM-score: 0.98
Native
OmegaFold
CASP T1005 Native
OmegaFold
CASP T1056 CASP
TM-score
C Runtime
Protein OmegaFold
Size AlphaFold2: Inference
7.6s
Single Seq.
366.8s
AlphaFold2 0.79
MSA
128.0s
RoseTTAFold 0.81 ~1000 990.2s
LDDT: 0.95 TM-score: 0.98 LDDT: 0.93 TM-score: 0.95
Fig. 1 Overview of OmegaFold and Results. (A) Model architecture of OmegaFold. The primary
protein sequence is first fed into a pretrained protein language model (OmegaPLM) to obtain
residue-level node embeddings and residue-residue pairwise embeddings. A stack of Geoformer
layers then iteratively updates these embeddings to improve their geometric consistency. Lastly, a
structure module predicts the 3D protein structures from the final embeddings. The predicted
structure and the embeddings could be fed again as input into another cycle through a recycling
procedure to predict a more refined structure. (B) Evaluations on recent CAMEO an CASP targets.
Our predictions (blue) for 7DKI:A, 7EBQ:A, 7ED6:A from CAMEO and T1005, T1056 from
CASP are highly accurate according to the experimental structures (green). Figures on the right
show held-out test results on 146 CAMEO targets and 29 challenging CASP targets. OmegaFold
significantly outperforms AlphaFold2 and RoseTTAFold when only single sequences are provided
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
as input on both standard CAMEO Local Distance Difference Tests (LDDTs) and CASP TM-
scores; OmegaFold performs comparably to AlphaFold2 and RoseTTAFold on the CASP and
CAMEO test cases when the standard MSAs are used as input. (C) Runtime analysis. OmegaFold
is significantly faster than AlphaFold2 (ColabFold version) on single-chain proteins with typical
lengths of around 250, 500 and 1000 residues. ColabFold was used to further decrease the runtimes
of the MSA search time (pink) and model inference time (red).
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
A Antibody Loops
Native OmegaFold AlphaFold2 Native OmegaFold AlphaFold2
CDR-H3 RMSD
OmegaFold 2.12Å
P=0.0017
PDB: 7KPJ RMSD: 0.38Å RMSD: 2.46Å PDB: 7PHU RMSD: 1.82Å RMSD: 5.83Å
PDB: 7SJS RMSD: 0.62Å RMSD: 4.19Å PDB: 7QJI RMSD: 2.11Å RMSD: 9.33Å AlphaFold2 (CDR-H3 RMSD)
B Orphan Proteins
Native OmegaFold AlphaFold2 Native OmegaFold AlphaFold2
OmegaFold (TM-score)
PDB: 7CG5 TM-score: 0.93 TM-score: 0.27 PDB: 7F7P TM-score: 0.90 TM-score: 0.65
OmegaFold 0.73
AlphaFold2 0.60
P=0.0238
PDB: 7S5L TM-score: 0.96 TM-score: 0.29 PDB: 7WRK TM-score: 0.85 TM-score: 0.30 AlphaFold2 (TM-score)
Fig. 2 OmegaFold performs significantly better than AlphaFold2 in modeling antibody CDR
H3 regions and predicting structures of orphan proteins. (A) Antibody CDRH3 regions.
OmegaFold predictions are colored blue; AlphaFold2 predictions,red; and native experimental
structures, green. CDR H3 regions are highlighted in plots. On CDR3 of nanobody 7QJI and
CDRH3 loops of antibodies 7KPJ, 7PHU, 7SJS, root mean square distances (RMSDs) of
OmegaFold loop predictions are significantly lower than AlphaFold2 predictions. The scatter plot
depicts the comparison on 33 recently released nanobody and antibody proteins with high-
resolution experimental structures. Overall, OmegaFold predictions (RMSD=2.12Å) are
significantly better than AlphaFold2 predictions (RMSD=2.98 Å), with a P-value of 0.0017. (B)
Orphan proteins. On orphan proteins 7CG5 (sensor domain of RsgI4), 7F7P (anti-CRISPR,
AcrIIC4), 7S5L (Cembrene A synthase) and 7WRK (hypothetical protein TTHA1873), TM-
scores—a common metric for assessing the topological similarity of protein structures—of
OmegaFold predictions are significantly higher than AlphaFold2 predictions. The scatter plot
shows comparisons on 19 recently released orphan proteins with no homologous sequences
identified. Overall, OmegaFold predictions (TM-score=0.73) are better than AlphaFold2
predictions (TM-score=0.60), with a P-value of 0.0238.
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
A Residue
Reconstruction
Motif
Reconstruction
Subsequence
Reconstruction
C Contact prediction accuracy increases with the number of Geoformer layers.
0.8
K A S G Y T F T S K A S G Y T F T S K A S G Y T F T S
Value
(66 layers)
Top L
K A S G Y T F T S K A S G Y T F T S K A S G Y T F T S
0.4 Top L/2
Top L/5
0.2
0 10 20 30 40 50
B Node Embeddings Embedding Space Pair Embeddings
Number of Geoformer Layers
j k
… j i k … ij ik
D
ij
kj
i ik
Geometrically Inconsistent
Geoformer Layer
kj
jjj kk Alternate
j k
kk
Iterate
ij
k Contact Map Predicted Intermediate Contact Map Final Contact Map
ik by OmegaPLM by 20 Geoformer Layers by All Geoformer Layers
ij
i ik i
● Top L: 0.393 ● Top L: 0.578 ● Top L: 0.704
● Top L/2: 0.612 ● Top L/2: 0.806 ● Top L/2: 0.985
Smooth Nodes with Pairs Smooth Pairs with Pairs ● Top L/5: 0.815 ● Top L/5: 0.926 ● Top L/5: 1.000
● Inconsistency: 0.638 ● Inconsistency: 0.160 ● Inconsistency: 0.104
Fig. 3 Geoformer refines the contact map predicted by protein language model (OmegaPLM)
through geometric smoothing. (A) OmegaPLM model is pretrained by per-residue mask loss,
per-motif mask loss and subsequence mask loss on unaligned protein sequences. (B) Geoformer
layers iteratively smooth node and pairwise embeddings and reduce geometric inconsistency
among them. Initially, node and pairwise embeddings generated by OmegaPLM reside in a latent
space with geometric inconsistency (red). In each Geoformer layer, these embeddings are updated
iteratively to refine the geometric inconsistency: each node embedding was updated with related
pairwise embeddings, and each pairwise embedding was updated by triangular consistency of
pairwise embeddings. (C) Geoformer layers improve geometry of contact predictions. Top L, Top
L/2, Top L/5 accuracies denote contact prediction precision among top predicted contact pairs.
Inconsistency is defined as the percentage of predicted distance triples {ij, jk, ik} that violate the
triangular inequality. It is clear to see that the prediction accuracy improves while inconsistency
drops with stacked Geoformer layers. (D) Visualization of contact maps. The contact maps
predicted in the initial, 20th and last Geoformer layer for protein 7EAD_A from the CAMEO
dataset. In the first 20 layers, Geoformer mainly focuses on resolving the triangular inconsistency,
which decreases from 0.638 to 0.160. In the last 30 layers, Geoformer mainly refines the details of
contact prediction, with TopL accuracy increasing from 0.578 to 0.704.
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
1
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
sufficient to predict the protein 3D structures (23). Instead of constructing a general pre-trained protein
language model (PLM) suitable for various protein tasks, in this work, we aim to design a PLM which could
greatly benefit protein 3D structure prediction. OmegaPLM strikes a good balance between efficiency
and performance, as shown in Figure S1 and Table S1, where OmegaPLM consumes less resources and
performs better.
Algorithm 1: Protein language models based on the Gated Attention Module (GAU)
1 def OmegaPLM( {ni }, dk =256, d =1,280, Nstack =66, dv =2,560):
2 for l ∈ [1, . . . , Nstack ] do
3 ri = LayerNorm(ni )
4 ui , vi , gi = SiLU (Linear (ri ))
5 {qi } = RoPE({wq ⊙ ui + bq })
6 {ki } = RoPE({w k ⊙ ui + bk })
7 αij = softmaxj log √ n q T kj + bi−j
dk i
P
8 oi = gi ⊙ j αij vj
9 ni += Linear(oi )
10 end
11 return {ni }
Pre-LayerNorm As shown in Algorithm 1, we choose the pre-LayerNorm operation which places the
layer normalization between the residual blocks. As suggested by recent studies, pre-LayerNorm yields
2
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
3.39 × 104
2.64 × 104
1.89 × 104 1 1
3 × 10 1.11 × 10 1.111 × 100
Average Runtime (in seconds) Average Runtime (in seconds)
Figure S1: OmegaPLM strikes a great balance between efficiency and performance. This figure is in log
scale both in time and in space. Each model in the plot has four points, where each one grows from 128 to
1024 exponentially in sequence length while decrease from 64 to 8 in batch size simultaneously. Points of
all models in this plot go from bottom left to top right. ProtT5 cannot fit on our testing GPU (Nvidia A100
80GB) with the same data size during training. Value as mean from best 64 rounds of 128 rounds in total.
more stable gradients especially at initialization (27–30). Current prevalent implementations of normaliza-
tion layers in different deep learning packages usually contain element-wise affine transformations with
learnable parameters immediately followed by the linear operations in many pre-layernorm Transformers
including our designed architecture. This configuration is mathematically redundant except for minor
differences caused by choice of optimizers during training. We therefore remove all element-wise affine
transformations in the pre-LayerNorm.
Gated Attention Unit Instead of using multi-headed self-attention (MHSA), we adopt the Gated
Attention Unit (GAU) (line 8 in Algorithm 1), which has shown great promise as an alternative to MHSAs
with smaller memory consumption and faster convergence rate (25). We apply the gate operation after
the attention aggregation and replace the conventional softmax(·) function with relu2 (·) to aggregate the
pairwise logits. In particular, we use an extra gating vector gi ∈ Rdv where dv is the dimensionality of the
value vector, which later multiply with the weighted summation of value vj in an element-wise fashion
(line 8).
Similar to conventional transformer models, All the value vectors vi , base vectors ui and gating vectors
gi in GAU (25) are also generated by a sequence of Linear-SiLU operations, which later are used to
produce queries and keys by using element-wise affine operations (line 4, line 5 & line 6 in Algorithm 1).
The original GAU in (25) also suggests that relu2 (·) performs better compared to softmax in terms of both
computation speed and convergence rate. This performance and speed gain can only be achieved when
3
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Table S1: Supervised contact prediction performance shows OmegaPLM clears performs well. The
contact prediction head is made of ResNets with identical configurations. The training (6,292 proteins)
and test (699 proteins) datasets were selected as subsets of PDB (each protein having < 25% sequence
similarity) using the following filtering conditions: (i) sequence length between 40 and 700, (ii) resolution
better than 2.5 Å, (iii) no domains made up of multiple protein chains, (iv) PDB file with valid Cβ
coordinates (Cα in GLY) no less than 80% compare to the complete protein sequence. Our validation data
keeps the same with (26) except 105 CASP11 test proteins. Proteins having sequence similarity > 25% or
BLAST E-value < 0.1 with any test protein were excluded from training data.
the lengths of training and test proteins are relatively the same. However, it is common that the length of
proteins to be predicted is out of the range of the lengths of training samples. To address this problem, we
choose the softmax operation since it performs better than relu2 (·) for sustaining the output distribution of
the aggregated value vectors, as the summation of the attention weights is always 1. Empirically, we find
that models trained with relu2 (·) performs considerably worse than their counterparts with softmax when
the input sequence length is out of the range of the trained samples, which is consistent with another study
that models trained with long sequences perform subpar with shorter sequences when using relu2 (·) (31).
There are a few other works targeting the extrapolation performance for the attention mechanism. (32)
introduce a hand-craft relative positional bias term added to the pre-softmax logits. (33) argues for scaling
the qk-product not only with inverse square root of dk , but also with log n, where n is the number of tokens
in one sequence. When viewed as a distribution and under certain assumptions, attention scores’ entropy
from the softmax operation oscillates less with varying sequence lengths with the log n scaling (33)
√
compared to normal dk , which improves generalization in terms of sequence length .
4
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Hyper-parameters of OmegaPLM The model configurations and the hyper-parameters are presented
in Table S2. We generate the output tokens (35) using tied embeddings and cosine normalization (28, 36).
The clipping thresholds in Table S2 indicate that the relative distance outside the range will be rounded to
the thresholds.
1.2 Training
Training Objective Unlike ESM-1b (12) or any other protein language models (38), to train OmegaPLM,
we setup a number of different objectives, we adopt the objective functions used in BERT (1) and
spanBERT (39) as our framework upon which we modify different parts to acquire different variants of
the objective function as follows,
1. BERT loss. For each sequence, 15% tokens are selected as targets to be predicted, 80% of which
are replaced with a [mask] token, 10% of which are replaced with a random amino acid, and the
final 10% stay what they are. This procedure is the same as the loss function used in the ESM-1b
work (12);
2. SpanBERT-like loss. We sample the span length from Poisson distribution with λ = 7 and clip
the sampled value at 5 and 8 and then mask the tokens consecutively according to the span length.
5
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Unlike SpanBERT, we still use the output embeddings from the corresponding tokens to perform
prediction rather than the boundary tokens of the spans.
3. Sequential masking, where we mask either the first half or the second half of the sequence, akin to
Prefix Language Modeling (4, 40)
Moreover, we assign different weights for these loss terms. The weights for the first two loss functions are
0.45 and the last one is 0.1.
Focal Loss We observe that many of the amino acids can be accurately predicted with its short-range
sequence context. This creates an easy prediction task for the model to learn and causes the model to
overly focus on short-range relations. To address this problem, we adopt the focal loss (41) to down-weight
the easy targets and make the model focus more on capturing the long range relationships among different
amino acids.
Data We train OmegaPLM using the Uniref50 (42) dataset with version 2021 04. During the training
process, we randomly sample center protein sequences from Uiref50 (42) as the input data, where each
batch contains 4,096 sequences and each sequence is padded or cropped to 512 residues. To fit one batch
into the GPU memory, we adopt the gradient checkpoint technique on 23 of the 66 layers in the training
process. Since OmegaPLM is trained for 3D structure prediction, we utilize 99.99% of the data for training
and the rest of sequences for validation.
Parameter Initialization All the weights in the output projection layers and biases are initialized as
zeros following (20, 43). All the other weights are initialized using the normal distribution with zero mean
and standard deviation of 0.02 following (25).
Recipe OmegaPLM is implemented in PyTorch (44) and trained for 2,560 GPU Nvidia A100 80G days.
We use the token-dropout scheme as in the ESM-1b model (12) during pre-training and AdamW (45)
algorithm as our optimizer for OmegaPLM. The peak learning rate is set to 5e-4 and all the other
parameters of AdamW stay as default in PyTorch. The warm-up of learning rate starts from 2.5e-6 and
follows a cosine scheme for 12.5k steps, the linear weight decay starts at step 100k and lasts 500k steps.
The learning rate stays constant at 2e-5 for another 100k steps. In addition, gradients of all parameters
are clipped based on the norm at 0.3 (46) to address the spikes of the loss and gradient norms during
training (47–50). In the hardware setup, we find the training process is accelerated by incorporating the
PowerSGD (51) gradient compression with rank 32 to reduce the communication loads across different
GPUs. Empirically we find PowerSGD improve 30% of the training speed. Though this compression
introduces noise into the gradients, we find such noise inconsequential compared to the speed gain in
convergence. Default precision format in Nvidia A100 GPUs is set to TensorFloat-32 for matrix operations.
6
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
We do not change this setting as IEEE float16 causes overflow issues during training and IEEE float32
slows computation considerably. In validation, we clip the long proteins to the first 1,024 residues.
7
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
As shown in Algorithm 3, we update the node embedding {ni } based on two factors: 1) the attention
between node i and any other node j and 2) the edge embedding {wij } capturing the interactions between
i and j in a more direct way. For each node, Geoformer first aggregates the information from all the
other nodes to generate a basic node representation (NodeAttention and NodeTransition functions in line
3-4 in Algorithm 3). Next, we produce another temporal edge representation solely based on the node
representation inferred from the previous step using the function Node2Edge. Similar to NodeAttention,
we also update edge representations {wij } based on all the other edges using a transformer-based model
EdgeAttention, which is also the function we rely on to achieve geometric consistency in the high
dimensional space. We repeat this process 50 times to generate both node and edge representations
for each residue and residue pair. In the last 8 layers of Geoformer (lines 11-15), we aim to maintain
the geometric consistency in the Euclidean space. In particular, we first translate the inferred node and
edge representation of a protein to a temporal 3D structure by using the StructureM odule function
implemented by AlphaFold2 (line 24-31 in Algorithm 20 in AlphaFold2). xi is the coordinates for the
atoms in amino acid i. We then translate the temporal 3D structure back to the high dimensional space by
using the 3Dprojection function whose outputs have the same dimensionality as the wij . In this way,
the updated edge representation contains the information indirectly encoded from the 3D space. Another
EdgeAttention function is applied to achieve geometric consistency for the newly updated wij . Eventually,
both node representation ni and edge representation wij from the last layer are used to predict the 3D
coordinates and connected to the loss functions. Note that we omit the LayerNorm operation on the set
variables such as {wij } to simplify the notation.
(ℓ−1)
qi , ki , vi = Linear(ni )
bij = Linear(wij )
1 T
αij = softmaxj √ qi kj + bij
c (1)
X
oi = sigmoid(Linear(nℓ−1
i )) ⊙ αij vj
j
8
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
ai , bi = Linear(ni )
(3)
wij = Linear (ai ⊗ bj ) .
9
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
i. An amino acid has a maximum of 14 atoms and each atom is associated with a 3D coordinate. Function
StructureM odule (line 24-31 in Algorithm 20 in AlphaFold2) predicts 3D coordinates by optimizing the
Transition and Rotation matrices. Based on this temporal 3D structure, function 3Dprojection produces a
new type of interaction based on three types of information extracted from the predicted temporal protein
structures {xi }: 1) the amino acid types of i and j; 2) the relative positions of each atom in amino acid i
and j in different frames; 3) the relative angles between the vectors of each atom in amino acid i and j in
different frames. The 3D-based interaction vector is projected by a Multilayer Perceptron (MLP) to match
the dimension of edge embedding. The function 3Dprojection is defined in Algorithm 4.
Algorithm 4: 3Dprojection
L
1 def 3Dprojection( Aaa(i) ∈ R , {⃗ xi } ∈ RL×14×3 , n = 3):
2 for i, j ∈ [1, L] do
// Embed all atom pair distances:
3 dij = ||⃗xi − ⃗xj ||
// Embed all atom frames
(n)
4 faa(i) = Framen (⃗xi )
// Embed distances in both rough and fine bins
5 aij = Linear(OneHot(dij , bins = [3.25A, 20.75A, 16]))
6 bij = Linear(OneHot(dij , bins = [2.3125A, 21.6875A, 64]))
// Embed frame-position directed distance in Euclidean
space:
(n) (n)
7 cij = OneHot(R(⃗xi , faa(j) ), bins = [−16Å, 16Å, 64], space = Euclidean)
// Embed frame-position angle distance in spherical space:
(n) (n)
8 eij = OneHot(R(⃗xi , faa(j) ), binsϕ = [0, 2π, 12], binsψ = [0, 2π, 12], space = Spherical)
// Embed pair amino acid types:
9 gij = Linear(aij + bij )
(n) (n)
10 hij = Linear(conactn ({cij }) + conactn ({eij }))
11 zij = Aaa(i) + Aaa(j)
12 wij = Linear(zij ⊗ (gij + hij ))
13 end
14 return {wij }
Here dij is the distance between atoms i and j from two amino acids. aij and bij are embeddings
based on one hot vectors of Euclidean distance between atoms i and j. The difference between aij and
(n)
bij is the ways of their bin constructions. Function Framen returns the frame faa(j) that represents the
local coordinate system with Cα , Cβ and N atoms of amino acid aa(j), and function R is a projection
(n)
function which returns the 3D coordinates xi in the n-th coordinate systems faa(j) defined based on the
amino acid associated with atom j. For each amino acid, we use eight-frame coordinate systems defined
by AlphaFold2 (Section 1.8 in the supplementary of AlphaFold2). aa(i) is the amino acid type that atom
i belongs to. zij represents the one-hot embedding related to the types of amino acids that atoms i and j
are associated with. We use an embedding matrix A ∈ R20×d to represent each of the 20 types of amino
10
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
0.84
0.82
0.80
LDDT
0.78
0.76
0 2 4 6 8 10 12 14 16
Number of recycles
(a) LDDT
0.88
0.86
TM-score
0.84
0.82
0.80
0 2 4 6 8 10 12 14 16
Number of recycles
(b) TM-score
Figure S2: Performance of structure prediction with recycling. OmegaFold reaches reasonable performance
without recycling, and achieves highest performance after around 10 recycles.
acids, where d denotes the dimension of the continuous embedding. Note that this embedding is not
position-specific and we refer to this embedding as type embedding because it corresponds to the type of
amino acids. cij and eij are designed to model the relative positions of two atoms in different coordinate
systems in both the Euclidean space and the spherical space. In particular, as it requires two angles ϕ
and ψ to describe a location in the spherical space, we discretize these two angles into two sets of bins
binϕ and binψ . The amino-acid embedding zij and physical location embeddings gij and hij are later
integrated together by using the outer product operation.
2.5 Recycling
Similar to AlphaFold2 and RoseTTAFold, we also employ a recycling procedure to iteratively refine the
quality of structure prediction. We find that with the prediction accuracy improves with the number of
recycles and saturates after about 10 recycles (Fig. S2).
2.6 Datasets
We download the protein structure data from the PDB website (https://www.rcsb.org/) and split each PDB
file into multiple single chains. A protein is removed if 1) it contains RNA fragments or 2) more than
90% of its residues are unknown amino acids which are marked as ‘X’ or 3) one of its chains can not be
11
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
parsed by PDBparser of biopython (52). Next, we cluster all the sequences using MMseqs2 (53) with
sequence identity threshold 100% and only keep the cluster centers. For the selected proteins, we perform
40% clustering using MMseqs2 and calculate a static sampling rate for each protein chain. The rate is
inverse to the size of the cluster the sample belongs to.
We use two validation datasets. The first CASP dataset includes challenging free-modeling (FM) targets
from recent CASP13 and CASP14: T0950, T0953s1, T0953s2, T0955, T0957s1, T0957s2, T0958, T0960,
T0963, T0968s1, T0968s2, T1005, T1008, T1026, T1027, T1029, T1030, T1031, T1032, T1033, T1038,
T1043, T1046s1, T1046s2, T1049, T1056, T1064, T1082, T1099.
A second dataset includes recently targets with a wide range of prediction difficulty from CAMEO
from date1 to date2: 7RPY:A, 7PB4:I, 7FJS:L, 7U2R:A, 7AAL:A, 7KO9:A, 7RKC:A, 7EQH:A, 7U4H:A,
7ETS:B, 7E4S:A, 7F3A:A, 7LXK:A, 7EAD:A, 7DKI:A, 7ESH:A, 7VNO:A, 7Z79:B, 7RW4:A, 7W5M:A,
7WNW:B, 7OPB:D, 7EQE:A, 7N0E:A, 7T4Z:A, 7ED6:A, 7NUV:A, 7TV9:C, 7ZCL:B, 7VWT:A, 7PC6:A,
7NQD:B, 7TXP:A, 7PC4:A, 7QRY:B, 7FEV:A, 7FIW:B, 7RPR:A, 7OA7:A, 7EBQ:A, 7YWG:B, 7UGH:A,
7F0A:A, 7U5Y:A, 7NDE:A, 7QIL:A, 7X4E:A, 7OD9:C, 7TBU:A, 7W26:A, 7X4O:B, 7NMI:B, 7WRK:A,
7QSS:A, 7LI0:A, 7RGV:A, 7VSP:C, 7X8J:A, 7QDV:A, 7E52:A, 7RCW:A, 7TNI:C, 7PC3:A, 7N29:B,
7F2Y:A, 7ZGF:A, 7T03:A, 7MYV:B, 7BLL:A, 7MQ4:A, 7X9E:A, 7F6J:C, 7EJG:C, 7V4S:A, 7QAP:A,
7ACY:B, 7MLA:B, 7QAO:A, 7WWR:A, 7QSU:A, 7PZJ:A, 7V1K:A, 7SGN:C, 7Z5P:A, 7N3T:C, 7EGT:B,
7O4O:A, 7CTX:B, 7VNX:A, 7YXG:A, 7QS5:A, 7X8V:A, 7MKU:A, 7RPS:A, 7MS2:A, 7QBP:A,
7QS2:A, 7EHG:E, 7PRQ:B, 7S2R:B, 7R74:B, 7W5U:A, 7O0B:A, 7TA5:A, 7WWX:A, 7Q51:A, 7SXB:A,
7WCJ:A, 7LXS:A, 7OSW:A, 7WRP:A, 7PNO:D, 7WJ9:A, 7RCZ:A, 7U5F:D, 7WME:A, 7RXE:A,
7B0K:A, 7ERN:C, 7R63:C, 7SJL:A, 7BI4:A, 7W1F:B, 7ED1:A, 7RAW:A, 7SO5:H, 7VOH:A, 7Q05:E,
7QBZ:A, 7PSG:C, 7P0H:A, 7MHW:A, 7P3I:B, 7ULH:A, 7R09:A, 7F0O:B, 7EQB:A, 7EFS:D, 7TZE:C,
7W74:A, 7PXY:A, 7PW1:A, 7E5J:A, 7V8E:B, 7ERP:B, 7R5Y:E.
12
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
X 1
max − λ ∗ |C| (4)
S
ri ∈S
d(rc , ri )2
s.t.
|S| = Ncrop , ∀ci ∈ C, |ci | ≥ 32
where function d(·, ·) measures the Euclidean distance between the Cβ atoms of a pair of amino acids (for
Alanine we use its Cα atom). S is the residue set we already select and Ncrop is total number of the amino
acids we want to select based on the size of GPU memory. C is the set of continuous sub-sequences in this
select 3D structure fragment, |C| is the number of sub-sequences and λ is a pre-defined hyper-parameter
which is set as 0.004 in experiments. For each sub-sequence ci ∈ C, the residues are continuous on
the sequence and located on the same chain. The central idea of this optimization problem is that we
want to select Ncrop amino acids around rc in the 3D space without too many sub-sequences, which is
implemented by adding the second term to penalize the number of sub-sequences. At the same time, we
also require the length of each sub-sequence ci to be at least 32 to prohibit too short sub-sequences.
Equation 4 could be solved by dynamic programming in a very efficient way. For a protein to be
cropped, we first concatenate all the chains of that protein into one sequence with n chains and a total
length of L. Then we design a temporal variable Fi,j (0 ≤ i ≤ L, 0 ≤ j ≤ Ncrop ), which indicates that the
maximum score we could achieve among the first i residues. At this step, we select j residues and residue
i has to be one of them. All the sub-sequences have at least length 32. We use another temporal variable
Gi,j (0 ≤ i ≤ L, 0 ≤ j ≤ Ncrop ) to denote the maximum value among the first i residues. However,
residue i is not necessary in the j residues we select. We need to require all the fragments to have at least
length 32. The relationship between Fi and Gi is Fi = maxj∈[0,i] Gj . The only difference between F and
G is that F requires residue i in the selected residue set but G does not have this requirement. F and G
can be calculated by using the following recursion formula,
1
Fi−1,j−1 +
, i > 0, j > 0;
d(rc , ri )2
1
X
Fi,j = max Gi−32,j−32 + − λ, i ≥ 32, j ≥ 32; (5)
i−31≤t≤i
d(rc , rt )2
−∞
Fi−1,j , i > 0, j ≥ 0;
Gi−1,j , i > 0, j ≥ 0;
Gi,j = max (6)
0, i = 0, j = 0;
−∞
The maximum value of Equation 4 is obtained at GL,Ncrop . By recording the generation process of
13
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
this optimal solution, we can collect the amino acids on the path to get sets S and C. For a selected set S,
we first generate the sequence embedding for each sub-sequence using OmegaPLM separately and then
concatenate them together as the embedding of the 3D structure fragment. In order to use OmegaPLM, we
need to specify a new residue index to calculate the positional embedding as the new data is not continuous.
We set the gap as 64 for the residue index between different sub-sequences, making the pair embedding of
different sub-sequences into two separate bins.
14
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
2.8 Training
Loss function We train the OmegaFold system using the contact loss, FAPE loss, torsion angle loss,
and structure violation loss as in AlphaFold2. We discard experimentally resolved loss which does not
influence the folding performance. We also discard masked MSA prediction loss since we no longer use
MSAs as input.
L + 0.5Laux + 0.3Ldist + 0.01Lconf first stage
F AP E
L= (7)
L + 0.5Laux + 0.3Ldist + 0.01Lconf + 1.0Lviol second and last stage
F AP E
Parameter Initialization The weights of all the output linear layers are initialized as zeroes and the
biases in linear layers before the sigmoid function are initialized as ones while the weights are zeros. All
the other linear layers are initialized with Glorot Uniform initialization.
Recipe The protein language model and 3D structure prediction model are trained separately as the
protein language model includes more unlabeled training data. In this stage, we freeze the protein language
model and only optimize the 3D structure prediction model by using the AdamW (55) algorithm with
default parameters. The batch size stays 128 through the whole training process. The learning rate linearly
increases from 0 to 5e-4 during the first 1,000 steps, stays 5e-4 for next 19,000 steps, then changes to 1e-4
for another 20,000 steps and is fixed as 5e-5 for the last 15,000 steps (Table S3). We also clip the gradient
norm using the threshold of 0.1. We use the function torch.nn.utils.clip grad norm in PyTorch (56), which
clips the gradient after collecting the gradients of all the samples using the all-reduce operation.
15
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Notably, after retraining with single sequences as input, without the language model, AlphaFold2 is not
able to predict structure with comparable performance.
2. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsuper-
vised multitask learners,” 2019.
5. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision
learners,” 2021. [Online]. Available: https://arxiv.org/abs/2111.06377
6. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of
visual representations,” in Proceedings of the 37th International Conference on Machine Learning,
ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR,
13–18 Jul 2020, pp. 1597–1607. [Online]. Available: https://proceedings.mlr.press/v119/chen20j.html
7. T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton, “Big self-supervised models are strong
semi-supervised learners,” arXiv preprint arXiv:2006.10029, 2020.
16
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
“Bootstrap your own latent: A new approach to self-supervised learning,” 2020. [Online]. Available:
https://arxiv.org/abs/2006.07733
9. X. Shao, H. Yang, X. Zhuang, J. Liao, P. Yang, J. Cheng, X. Lu, H. Chen, and X. Fan, “scDeepSort: a
pre-trained cell-type annotation method for single-cell transcriptomics using deep learning with a
weighted graph neural network,” Nucleic Acids Research, vol. 49, no. 21, pp. e122–e122, 09 2021.
[Online]. Available: https://doi.org/10.1093/nar/gkab775
10. F. Yang, W. Wang, F. Wang, Y. Fang, D. Tang, J. Huang, H. Lu, and J. Yao, “scbert is a large-scale
pretrained deep language model for cell type annotation of single-cell rna-seq data,” bioRxiv, 2022.
[Online]. Available: https://www.biorxiv.org/content/early/2022/06/06/2021.12.05.471261
11. E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, and G. M. Church, “Unified rational protein
engineering with sequence-based deep representation learning,” Nature Methods, vol. 16, no. 12, pp.
1315–1322, Dec 2019. [Online]. Available: https://doi.org/10.1038/s41592-019-0598-1
13. T. Bepler and B. Berger, “Learning the protein language: Evolution, structure, and
function,” Cell Systems, vol. 12, no. 6, pp. 654–669.e3, Jun 2021. [Online]. Available:
https://doi.org/10.1016/j.cels.2021.05.017
14. R. Rao, J. Meier, T. Sercu, S. Ovchinnikov, and A. Rives, “Transformer protein language models are
unsupervised structure learners,” in International Conference on Learning Representations, 2021.
[Online]. Available: https://openreview.net/forum?id=fylclEqgvgd
15. J. Vig, A. Madani, L. R. Varshney, C. Xiong, richard socher, and N. Rajani, “{BERT}ology meets
biology: Interpreting attention in protein language models,” in International Conference on Learning
Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YWtLZvLmud7
17. J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, and A. Rives, “Language models enable zero-shot
prediction of the effects of mutations on protein function,” in Advances in Neural Information
Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021. [Online].
Available: https://openreview.net/forum?id=uXc42E9ZPFs
17
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
18. B. Hie, E. D. Zhong, B. Berger, and B. Bryson, “Learning the language of viral
evolution and escape,” Science, vol. 371, no. 6526, pp. 284–288, 2021. [Online]. Available:
https://www.science.org/doi/abs/10.1126/science.abd7331
19. B. L. Hie, K. K. Yang, and P. S. Kim, “Evolutionary velocity with protein language models predicts
evolutionary dynamics of diverse proteins,” Cell Systems, vol. 13, no. 4, pp. 274–285.e6, Apr 2022.
[Online]. Available: https://doi.org/10.1016/j.cels.2022.01.003
21. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search
tool,” J Mol Biol, vol. 215, no. 3, pp. 403–410, Oct 1990.
22. M. Steinegger, M. Meier, M. Mirdita, H. Vöhringer, S. J. Haunsberger, and J. Söding, “Hh-suite3 for
fast remote homology detection and deep protein annotation,” BMC Bioinformatics, vol. 20, no. 1, p.
473, Sep 2019. [Online]. Available: https://doi.org/10.1186/s12859-019-3019-7
25. W. Hua, Z. Dai, H. Liu, and Q. V. Le, “Transformer quality in linear time,” 2022. [Online]. Available:
https://arxiv.org/abs/2202.10447
26. J. Xu, “Distance-based protein folding powered by deep learning,” Proceedings of the
National Academy of Sciences, vol. 116, no. 34, pp. 16 856–16 865, 2019. [Online]. Available:
https://www.pnas.org/doi/abs/10.1073/pnas.1821309116
27. A. Baevski and M. Auli, “Adaptive input representations for neural language modeling,”
in International Conference on Learning Representations, 2019. [Online]. Available: https:
//openreview.net/forum?id=ByxZX20qFQ
18
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
28. T. Q. Nguyen and J. Salazar, “Transformers without tears: Improving the normalization of
self-attention,” in Proceedings of the 16th International Conference on Spoken Language Translation.
Hong Kong: Association for Computational Linguistics, Nov. 2-3 2019. [Online]. Available:
https://aclanthology.org/2019.iwslt-1.17
29. R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu,
“On layer normalization in the transformer architecture,” in Proceedings of the 37th International
Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III
and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 10 524–10 533. [Online]. Available:
https://proceedings.mlr.press/v119/xiong20b.html
30. L. Liu, X. Liu, J. Gao, W. Chen, and J. Han, “Understanding the difficulty of training transformers,”
in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 5747–5763. [Online].
Available: https://aclanthology.org/2020.emnlp-main.463
31. J. Su. (2022, Apr) Softmax pairs with attention to achieve interpolation and extrapolation. [Online].
Available: https://kexue.fm/archives/9019
32. O. Press, N. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input
length extrapolation,” in International Conference on Learning Representations, 2022. [Online].
Available: https://openreview.net/forum?id=R8sQPpGCv0
33. J. Su. (2021, Dec) On the scaling of attention mechanisms from the perspective of entropy
conservation. [Online]. Available: https://kexue.fm/archives/8823
34. J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer: Enhanced transformer with rotary position
embedding,” 2021. [Online]. Available: https://arxiv.org/abs/2104.09864
35. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and
V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” ArXiv, vol. abs/1907.11692,
2019.
36. C. Luo, J. Zhan, X. Xue, L. Wang, R. Ren, and Q. Yang, “Cosine normalization: Using cosine
similarity instead of dot product in neural networks,” in Artificial Neural Networks and Machine
Learning – ICANN 2018, V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis,
Eds. Cham: Springer International Publishing, 2018, pp. 382–391.
37. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016. [Online]. Available:
https://arxiv.org/abs/1607.06450
38. A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi, W. Yu, L. Jones, T. Gibbs, T. Feher, C. Angerer,
M. Steinegger, D. Bhowmik, and B. Rost, “Prottrans: Towards cracking the language of lifes code
19
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
through self-supervised deep learning and high performance computing,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
39. M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “SpanBERT: Improving
pre-training by representing and predicting spans,” Transactions of the Association for Computational
Linguistics, vol. 8, pp. 64–77, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.5
40. P. J. Liu*, M. Saleh*, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer, “Generating
wikipedia by summarizing long sequences,” in International Conference on Learning Representations,
2018. [Online]. Available: https://openreview.net/forum?id=Hyg0vbWC-
41. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in 2017
IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999–3007.
42. B. E. Suzek, Y. Wang, H. Huang, P. B. McGarvey, C. H. Wu, and the UniProt Consortium,
“UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity
searches,” Bioinformatics, vol. 31, no. 6, pp. 926–932, 11 2014. [Online]. Available:
https://doi.org/10.1093/bioinformatics/btu739
43. P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and
K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” 2017. [Online]. Available:
https://arxiv.org/abs/1706.02677
45. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on
Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7
46. R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural
networks,” in Proceedings of the 30th International Conference on Machine Learning, ser.
Proceedings of Machine Learning Research, S. Dasgupta and D. McAllester, Eds., vol. 28,
no. 3. Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1310–1318. [Online]. Available:
https://proceedings.mlr.press/v28/pascanu13.html
20
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas, “A generalist agent,” 2022. [Online]. Available:
https://arxiv.org/abs/2205.06175
48. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V.
Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and
L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” 2022. [Online]. Available:
https://arxiv.org/abs/2205.01068
53. M. Steinegger and J. Söding, “Mmseqs2 enables sensitive protein sequence searching for the analysis
of massive data sets,” Nature biotechnology, vol. 35, no. 11, pp. 1026–1028, 2017.
54. B. Li, Y. Liu, and X. Wang, “Gradient harmonized single-stage detector,” in Proceedings of the AAAI
conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8577–8584.
55. S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv preprint
arXiv:1904.09237, 2019.
21
bioRxiv preprint doi: https://doi.org/10.1101/2022.07.21.500999; this version posted July 22, 2022. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
56. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in
neural information processing systems, vol. 32, 2019.
22