🧬📝 Awesome Biomolecule-Language Cross Modeling

The repository for Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey, including related models, datasets/benchmarks, and other resource links.

🔥 2025.12.01: We have updated our paper and repository with new models, datasets, and additional features developed since February 2024. Please check the latest version of our paper and the updated repository for more details.

🌟 If you have a paper or resource you'd like to add, feel free to submit a pull request, open an issue, or email the author at [email protected].

Table of Content

Models
Datasets & Benchmarks
Related Resources
- Related Surveys & Evaluations
- Related Repositories
Acknowledgements

Models

BioText Bioinformatics 2019 BioBERT: a pre-trained biomedical language representation model for biomedical text mining |
BioText EMNLP IJCNLP 2019 SciBERT: A Pretrained Language Model for Scientific Text |
BioText BioNLP@ACL 2019 (BlueBERT) Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets |
BioText EMNLP 2020 Bio-Megatron: Larger Biomedical Domain Language Model |
BioText BioNLP@CHIL 2020 ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission |
BioText BioNLP@ACL 2021 BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA |
BioText HEALTH 2021 (PubMedBERT) Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing |
BioText Arxiv 2021 SciFive: a text-to-text transformer model for biomedical literature |
BioText NeurIPS 2022 (DRAGON) Deep Bidirectional Language-Knowledge Graph Pretraining |
BioText ACL 2022 LinkBERT: Pretraining Language Models with Document Links |
BioText BioNLP@ACL 2022 BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model |
BioText Bioinformatics 2022 BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining |
BioText Arxiv 2022 GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records |
BioText Nature 2023 Large language models encode clinical knowledge |
BioText ACL 2023 (ScholarBERT) The Diminishing Returns of Masked Language Models to Science |
BioText Arxiv 2023 PMC-LLaMA: Further Finetuning LLaMA on Medical Papers |
BioText Arxiv 2023 BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine |
BioText Arxiv 2023 (GatortronGPT) A study of generative large language model for medical research and healthcare |
BioText Arxiv 2023 Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding |
BioText Arxiv 2023 MEDITRON-70B: Scaling Medical Pretraining for Large Language Models |
BioText Arxiv 2023 BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-inspired Materials |
BioText Arxiv 2023 ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation |
BioText Arxiv 2023 MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training Data |
BioText Arxiv 2024 SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning |
BioText Arxiv 2024 BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text |
BioText ACL 2024 BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains |
Text + Molecule EMNLP 2021 Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries |
Text + Molecule EMNLP 2022 (MolT5) Translation between Molecules and Natural Language |
Text + Molecule Nature Communications 2022 (KV-PLM) A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals |
Text + Molecule Arxiv 2022 (MoMu) A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language |
Text + Molecule ICML 2023 (Text+Chem T5) Unifying Molecular and Textual Representations via Multi-task Language Modelling |
Text + Molecule ICML 2023 (CLAMP) Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language |
Text + Molecule NeurIPS 2023 GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning |
Text + Molecule AI4D3@NeurIPS 2023 MoleculeGPT: Instruction Following Large Language Models for Molecular Property Prediction |
Text + Molecule Datasets&Benchmarks@NeurIPS 2023 (ChemLLMBench) What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks |
Text + Molecule ACL 2023 MolXPT: Wrapping Molecules with Text for Generative Pre-training |
Text + Molecule EMNLP 2023 (TextReact) Predictive Chemistry Augmented with Text Retrieval |
Text + Molecule EMNLP 2023 MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter |
Text + Molecule EMNLP 2023 ReLM: Leveraging Language Models for Enhanced Chemical Reaction Prediction |
Text + Molecule Nature Machine Intelligence 2023 (MoleculeSTM) Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing |
Text + Molecule IEEE TAI 2023 (AMAN) Adversarial Modality Alignment Network for Cross-Modal Molecule Retrieval |
Text + Molecule Bioinformatics 2024 MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations |
Text + Molecule Arxiv 2023 (MolReGPT) Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective |
Text + Molecule Arxiv 2023 (CaR) Can Large Language Models Empower Molecular Property Prediction? |
Text + Molecule Arxiv 2023 MolFM: A Multimodal Molecular Foundation Model |
Text + Molecule Arxiv 2023 (ChatMol) Interactive Molecular Discovery with Natural Language |
Text + Molecule Arxiv 2023 InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery |
Text + Molecule Arxiv 2023 ChemCrow: Augmenting large-language models with chemistry tools |
Text + Molecule Arxiv 2023 GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction |
Text + Molecule Arxiv 2023 nach0: Multimodal Natural and Chemical Languages Foundation Model |
Text + Molecule Arxiv 2023 DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs |
Text + Molecule AAAI 2024 (Ada/Aug-T5) From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery |
Text + Molecule AAAI 2024 MolTailor: Tailoring Chemical Molecular Representation to Specific Tasks via Text Prompts |
Text + Molecule AAAI 2024 (TGM-DLM) Text-Guided Molecule Generation with Diffusion Language Model |
Text + Molecule Computers in Biology and Medicine 2024 GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text |
Text + Molecule Chemical Science 2024 PolyNC: a natural and chemical language model for the prediction of unified polymer properties |
Text + Molecule Arxiv 2024 MolTC: Towards Molecular Relational Modeling In Language Models |
Text + Molecule Arxiv 2024 T-Rex: Text-assisted Retrosynthesis Prediction |
Text + Molecule COLM 2024 LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset |
Text + Molecule Arxiv 2024 (Drug-to-indication) Emerging Opportunities of Using Large Language Models for Translation Between Drug Molecules and Indications |
Text + Molecule Arxiv 2024 ChemDFM: Dialogue Foundation Model for Chemistry |
Text + Molecule Briefings in Bioinformatics 2024 DrugAssist: A Large Language Model for Molecule Optimization |
Text + Molecule Arxiv 2024 ChemLLM: A Chemical Large Language Model |
Text + Molecule OpenReview (TEDMol) Text-guided Diffusion Model for 3D Molecule Generation |
Text + Molecule Bioinformatics 2025 (3DToMolo) Sculpting Molecules in 3D: A Flexible Substructure Aware Framework for Text-Oriented Molecular Optimization |
Text + Molecule IEEE TKDE‌ 2024 (ICMA) Large Language Models are In-Context Molecule Learners |
Text + Molecule Arxiv 2024 (LLMaMol) Benchmarking Large Language Models for Molecule Prediction Tasks |
Text + Molecule Arxiv 2024 3M-Diffusion: Latent Multi-Modal Diffusion for Text-Guided Generation of Molecular Graphs |
Text + Molecule Biology 2024 (TSMMG) Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model |
Text + Molecule Eng Appl Artif Intell (SLM4CRP) A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions |
Text + Molecule ICLR 2025 Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation |
Text + Molecule ACL 2024 ReactXT: Understanding Molecular"Reaction-ship"via Reaction-Contextualized Molecule-Text Pretraining |
Text + Molecule L+M 2024 ALMol: Aligned Language-Molecule Translation LLMs through Offline Preference Contrastive Optimisation |
Text + Molecule Arxiv 2024 LDMol: Text-Conditioned Molecule Diffusion Model Leveraging Chemically Informative Latent Space |
Text + Molecule Arxiv 2024 DrugLLM: Open Large Language Model for Few-shot Molecule Generation |
Text + Molecule Arxiv 2024 (HI-Mol) Data-Efficient Molecular Generation with Hierarchical Textual Inversion |
Text + Molecule KDD 2024 (MV-Mol) Learning Multi-view Molecular Representations with Structured and Unstructured Knowledge |
Text + Molecule NLPCC 2024 DRAK: Unlocking Molecular Insights with Domain-Specific Retrieval-Augmented Knowledge in LLMs |
Text + Molecule Arxiv 2024 HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment |
Text + Molecule EMNLP 2024 PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes |
Text + Molecule Arxiv 2024 3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization |
Text + Molecule Arxiv 2024 MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction |
Text + Molecule Arxiv 2024 MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension |
Text + Molecule Arxiv 2024 (AMOLE) Vision Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models |
Text + Molecule Arxiv 2024 (Chemma-RC) Text-Augmented Multimodal LLMs for Chemical Reaction Condition Recommendation |
Text + Molecule ICLR 2024 (SMILES-probing) Chemical Language Models Have Problems with Chemistry: A Case Study on Molecule Captioning Task |
Text + Molecule Arxiv 2024 UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation |
Text + Molecule ECAI 2025 (UTGDiff) Instruction-Based Molecular Graph Generation with Unified Text-Graph Diffusion Model |
Text + Molecule ACL 2024 Mol2Lang-VLM: Vision- and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal Fusion |
Text + Molecule ACL 2024 Lang2Mol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging SELFIES Representation |
Text + Molecule ACL 2024 (Self-Augmentation) Enhancing Cross Text-Molecule Learning by Self-Augmentation |
Text + Molecule ACL 2024 MTSwitch: A Web-based System for Translation between Molecules and Texts |
Text + Molecule NeurIPS 2024 SmileyLlama: Modifying Large Language Models for Directed Chemical Space Exploration |
Text + Molecule NeurIPS 2024 (MoleculeSTM) Geometry-text Multi-modal Foundation Model for Reactivity-oriented Molecule Editing |
Text + Molecule ACL 2025 (MSR) Structural Reasoning Improves Molecular Understanding of LLM |
Text + Molecule Arxiv 2024 (TransDLM) Text-Guided Multi-Property Molecular Optimization with a Diffusion Language Model |
Text + Molecule Arxiv 2024 Can LLMs Generate Diverse Molecules? Towards Alignment with Structural Diversity |
Text + Molecule JCIM 2025 (ChemLML) Chemical Language Model Linker: blending text and molecules with modular adapters |
Text + Molecule NeurIPS 2024 (Chemlactica) Small Molecule Optimization with Large Language Models |
Text + Molecule NeurIPS 2024 Question Rephrasing for Quantifying Uncertainty in Large Language Models: Applications in Molecular Chemistry Tasks |
Text + Molecule NeurIPS 2024 (LLaMo) Large Language Model-based Molecular Graph Assistant |
Text + Molecule NeurIPS 2024 (MolPuzzle) Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation |
Text + Molecule Arxiv 2024 (M³LLM) Exploring Hierarchical Molecular Graph Representation in Multimodal LLMs |
Text + Molecule Arxiv 2024 MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts |
Text + Molecule ENMLP 2024 (AMORE) Lost in Translation: Chemical Language Models and the Misunderstanding of Molecule Structures |
Text + Molecule BIBM 2024 GeomCLIP: Contrastive Geometry-Text Pre-training for Molecules |
Text + Molecule BIBM 2024 (CMTMR) Towards Cross-Modal Text-Molecule Retrieval with Better Modality Alignment |
Text + Molecule BIBM 2024 (ORMA) Exploring Optimal Transport-Based Multi-Grained Alignments for Text-Molecule Retrieval |
Text + Molecule Arxiv 2024 PEIT: Property Enhanced Instruction Tuning for Multi-task Molecular Generation with LLMs |
Text + Molecule Arxiv 2024 (HME) Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding |
Text + Molecule ICLR 2025 (Llamole) Multimodal Large Language3D-MOLT5 Models for Inverse Molecular Design with Retrosynthetic Planning |
Text + Molecule ICLR 2025 RetroInText: A Multimodal Large Language Model Enhanced Framework for Retrosynthetic Planning via In-Context Representation Learning |
Text + Molecule Arxiv 2025 OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery |
Text + Molecule NeurIPS 2025 Omni-Mol: Multitask Molecular Model for Any-to-any Modalities |
Text + Molecule Arxiv 2025 Mol-LLM: Multimodal Generalist Molecular LLM with Improved Graph Utilization |
Text + Molecule NMI 2025 (SLM4Mol) A Quantitative Analysis of Knowledge-Learning Preferences in Large Language Models in Molecular Science |
Text + Molecule Arxiv 2025 CLASS: Enhancing Cross-Modal Text-Molecule Retrieval Performance and Training Efficiency |
Text + Molecule NeurIPS 2025 Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model |
Text + Molecule Bioinformatics 2025 ChatMol: A Versatile Molecule Designer Based on the Numerically Enhanced Large Language Model |
Text + Molecule EMNLP 2025 MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model |
Text + Molecule Arxiv 2025 GraphT5: Unified Molecular Graph-Language Modeling via Multi-Modal Cross-Token Attention |
Text + Molecule JBHI 2025 XMolCap: Advancing Molecular Captioning through Multimodal Fusion and Explainable Graph Neural Networks |
Text + Molecule JCIM 2025 (ChatChemTS) Large language models open new way of AI-assisted molecule design for chemists |
Text + Molecule JCIM 2025 (LLM-MPP) Effective and Explainable Molecular Property Prediction by Chain-of-Thought Enabled Large Language Models and Multi-Modal Molecular Information Fusion |
Text + Molecule AAAI 2025 Graph2Token: Make LLMs Understand Molecule Graphs |
Text + Molecule AAAI 2025 ExDDI: Explaining Drug-Drug Interaction Predictions with Natural Language |
Text + Molecule AAAI 2025 ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area |
Text + Molecule ICJAI 2025 ChemDual: Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning |
Text + Molecule ACL 2025 GeLLM³O: Generalizing Large Language Models for Multi-property Molecule Optimization |
Text + Molecule ACL 2025 Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation |
Text + Molecule EMNLP 2025 (GeLLMO-C) Large Language Models for Controllable Multi-property Multi-objective Molecule Optimization |
Text + Molecule AAAI 2025 ReactGPT: Understanding of Chemical Reactions via In-Context Tuning |
Text + Molecule AAAI 2025 nach0-pc: Multi-task Language Model with Molecular Point Cloud Encoder |
Text + Molecule Arxiv 2025 mCLM: A Function-Infused and Synthesis-Friendly Modular Chemical Language Model |
Text + Molecule Arxiv 2025 ChemMLLM: Chemical Multimodal Large Language Model |
Text + Molecule Arxiv 2025 RTMol: Rethinking Molecule-text Alignment in a Round-trip View |
Text + Molecule Arxiv 2025 (CLEANMOL) Improving Chemical Understanding of LLMs via SMILES Parsing |
Text + Molecule NeurIPS 2025 ModuLM: Enabling Modular and Multimodal Molecular Relational Learning with Large Language Models |
Text + Molecule arXiv 2025 (ToDi) TextOmics-Guided Diffusion for Hit-like Molecular Generation |
Text + Molecule Arxiv 2025 ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge |
Text + Molecule PAKDD 2025 Dual Learning Between Molecules and Natural Language |
Text + Molecule MM 2025 CROP: Integrating Topological and Spatial Structures via Cross-View Prefixes for Molecular LLMs |
Text + Molecule Arxiv 2025 Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery |
Text + Molecule Arxiv 2025 AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models |
Text + Molecule Bioinformatics 2025 MolPrompt: improving multi-modal molecular pre-training with knowledge prompts |
Text + Molecule ENMLP 2025 (CAMT5) Training Text-to-Molecule Models with Context-Aware Tokenization |
Text + Molecule Arxiv 2025 Enhancing Molecular Property Prediction with Knowledge from Large Language Models |
Text + Molecule Knosys 2025 (MolFinePrompt) Fine-grained multimodal molecular pretraining via prompt learning |
Text + Molecule Arxiv 2025 (MPPReasoner) Reasoning-Enhanced Large Language Models for Molecular Property Prediction |
Text + Molecule Arxiv 2025 (MECo) Coder as Editor: Code-Driven Interpretable Molecule Optimization |
Text + Molecule Arxiv 2025 Chem-R: Learning to Reason as a Chemist |
Text + Molecule Arxiv 2025 KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge |
Text + Molecule IEEE Trans Comput Soc Syst 2025 (Mol-LLM) Incorporating Molecular Knowledge in Large Language Models via Multimodal Modeling |
Text + Molecule MM 2025 DeepMolTex: Deep Alignment of Molecular Graphs with Large Language Models via Mixture of Modality Experts |
Text + Molecule Neurocomputing 2025 Mol-L2: Transferring text knowledge with frozen language models for molecular representation learning |
Text + Molecule ACL 2025 (MolRAG) Unlocking the Power of LLMs for Molecular Property Prediction |
Text + Molecule ACL 2025 Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning |
Text + Molecule EMNLP 2025 (MolBridge) Bridging the Gap Between Molecule and Textual Descriptions via Substructure-aware Alignment |
Text + Molecule EMNLP 2025 GAMIC: Graph-Aligned Molecular In-context Learning for Molecule Analysis via LLMs |
Text + Molecule EMNLP 2025 How to Make Large Language Models Generate 100% Valid Molecules? |
Text + Molecule EMNLP 2025 Molecular String Representation Preferences in Pretrained LLMs: A Comparative Study in Zero- & Few-Shot Molecular Property Prediction |
Text + Protein ICLR 2022 OntoProtein: Protein Pretraining With Gene Ontology Embedding |
Text + Protein RECOMB 2022 ProTranslator: Zero-Shot Protein Function Prediction Using Textual Description |
Text + Protein ICML 2023 ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts |
Text + Protein Nature Biotechnology 2023 (ProGen) Large language models generate functional protein sequences across diverse families |
Text + Protein Arxiv 2023 (ProteinDT) A Text-guided Protein Design Framework |
Text + Protein TechRxiv 2023 ProteinChat: Towards Achieving ChatGPT-Like Functionalities on Protein 3D Structures |
Text + Protein ICML 2025 2023 (ProteinChat v2) Multi-Modal Large Language Model Enables Protein Function Prediction |
Text + Protein AAAI 2024 Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers |
Text + Protein Arxiv 2024 ProtAgents: Protein discovery via large language model multi-agent collaborations combining physics and machine learning |
Text + Protein TAI 2024 ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing |
Text + Protein ACL 2024 ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training |
Text + Protein ACL 2024 ProtT3: Protein-to-Text Generation for Text-based Protein Understanding |
Text + Protein bioRxiv 2024 ProteinCLIP: enhancing protein language models with natural language |
Text + Protein Arxiv 2024 ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction |
Text + Protein KDD 2025 (PAAG) Annotation-guided Protein Design with Multi-Level Domain Alignment |
Text + Protein bioRrxiv 2024 (Pinal) Toward De Novo Protein Design from Natural Language |
Text + Protein Arxiv 2024 TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering |
Text + Protein KDD 2025 (SEPIT) Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding with LLMs |
Text + Protein NeurIPS 2024 (BioM3) Natural Language Prompts Guide the Design of Novel Functional Protein Sequences |
Text + Protein NeurIPS 2024 (LLM4ProteinEvolution) Language Models for Text-guided Protein Evolution |
Text + Protein NeurIPS 2024 MMSite: A Multi-modal Framework for the Identification of Active Sites in Proteins |
Text + Protein NeurIPS 2024 MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering |
Text + Protein Arxiv 2024 EvoLlama: Enhancing LLMs' Understanding of Proteins via Multimodal Structure and Sequence Representations |
Text + Protein Bioinformatics 2024 FAPM: functional annotation of proteins using multimodal models beyond structural modeling |
Text + Protein bioRxiv 2024 ProCyon: A multimodel foundation model for protein phenotypes |
Text + Protein bioRxiv 2025 (Evolla) Decoding the Molecular Language of Proteins with Evolla |
Text + Protein ICLR 2025 ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding |
Text + Protein ICLR 2025 (MP4) A generalized protein design ML model enables generation of functional de novo proteins |
Text + Protein SIGIR 2025 ProtChatGPT: Towards Understanding Proteins with Hybrid Representation and Large Language Models |
Text + Protein AAAI 2025 Protclip: Function-informed protein multi-modal learning |
Text + Protein AAAI 2025 (CtrlProt) Controllable Protein Sequence Generation with LLM Preference Optimization Authors |
Text + Protein NAACL 2025 Protein2Text: Resampling Mechanism to Translate Protein Sequences into Human-Interpretable Text |
Text + Protein EMNLP 2025 (RAPM): Rethinking Text-based Protein Understanding: Retrieval or LLM? |
Text + Protein NeurIPS 2025 Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment |
Text + Protein NeurIPS 2025 (ProDVa) Protein Design with Dynamic Protein Vocabulary |
Text + Protein ICML 2025 ProteinAligner: A Tri-Modal Contrastive Learning Framework for Protein Representation Learning |
Text + Protein JCIM 2025 Prottex: Structure-in-context reasoning and editing of proteins with large language models |
Text + Protein Bioinformatics 2025 Prot2Chat: protein large language model with early fusion of text, sequence, and structure |
Text + Protein Nature Biotechnology 2025 ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning |
Text + Protein Nature Communications 2025 Ab-initio amino acid sequence design from protein text description with ProtDAT |
Text + Protein ACL 2025 Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization |
Text + Protein ACL 2025 (LLaPA) Large Language and Protein Assistant for Protein-Protein Interactions Prediction |
Text + Protein Arxiv 2025 Protein as a Section Language for LLMs |
Text + Protein Arxiv 2025 CMADiff: Cross-Modal Aligned Diffusion for Controllable Protein Generation |
Text + Protein Arxiv 2025 Guiding Generative Models for Protein Design: Prompting, Steering and Aligning |
Text + Protein ICLR Submission 2026 Caduceus: MoE-enhanced Foundation Models Unifying Biological and Natural Language |
Text + BioMulti Arxiv 2022 Galactica: A Large Language Model for Science |
Text + BioMulti EMNLP 2023 BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations |
Text + BioMulti Arxiv 2023 DARWIN Series: Domain Specific Large Language Models for Natural Science |
Text + BioMulti Arxiv 2023 BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine |
Text + BioMulti Arxiv 2023 (StructChem) Structured Chemistry Reasoning with Large Language Models |
Text + BioMulti Nature Communications 2023 (BioTranslator) Multilingual translation for zero-shot biomedical classification using BioTranslator |
Text + BioMulti ICLR 2024 Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models |
Text + BioMulti ICLR 2024 (ChatDrug) ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback |
Text + BioMulti ICLR 2024 BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs |
Text + BioMulti AAAI 2024 (KEDD) Towards Unified AI Drug Discovery with Multiple Knowledge Modalities |
Text + BioMulti AAAI 2024 （Otter Knowledge) Knowledge Enhanced Representation Learning for Drug Discovery |
Text + BioMulti Arxiv 2024 ChatCell: Facilitating Single-Cell Analysis with Natural Language |
Text + BioMulti ICML 2024 LangCell: Language-Cell Pre-training for Cell Identity Understanding |
Text + BioMulti ACL 2024 BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning |
Text + BioMulti Arxiv 2024 MolBind: Multimodal Alignment of Language, Molecules, and Proteins |
Text + BioMulti Arxiv 2024 Uni-SMART: Universal Science Multimodal Analysis and Research Transformer |
Text + BioMulti ICML 2024 Tag-LLM: Repurposing General-Purpose LLMs for Specialized Domains |
Text + BioMulti Arxiv 2024 An Evaluation of Large Language Models in Bioinformatics Research |
Text + BioMulti bioRxiv 2024 SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences |
Text + BioMulti NeurIPS 2024 SciDFM: A Large Language Model with Mixture-of-Experts for Science |
Text + BioMulti Sci China Inf Sci 2024 ChemDFM-X: Towards Large Multimodal Model for Chemistry |
Text + BioMulti Arxiv 2025 (NatureLM) Nature Language Model: Deciphering the Language of Nature for Scientific Discovery |
Text + BioMulti Cell Rep Phys Sci 2024 [(ChemDFM) Developing ChemDFM as a large language foundation model for chemistry](https://www.cell.com/cell-reports-physical-science/fulltext/S2666-3864(25) |
Text + BioMulti Information Fusion 2025 (KFPPIMI) Improving protein–protein interaction modulator predictions via knowledge-fused language models |
Text + BioMulti Arxiv 2025 STELLA: Towards Protein Function Prediction with Multimodal LLMs Integrating Sequence-Structure Representations |
Text + BioMulti Arxiv 2025 (CAFT) Improving Large Language Models with Concept-Aware Fine-Tuning |
Text + BioMulti Arxiv 2025 InstructPro: Natural Language Guided Ligand-Binding Protein Design |
Text + BioMulti NMI 2025 (InstructBioMol) Advancing biomolecular understanding and design following human instructions |
Text + BioMulti bioRxiv 2025 DrugLM: A Unified Framework to Enhance Drug-Target Interaction Predictions by Incorporating Textual Embeddings via Language Models |
Text + BioMulti Arxiv 2025 Intern-S1: A Scientific Multimodal Foundation Model |
Text + BioMulti Arxiv 2025 Chem3DLLM: 3D Multimodal Large Language Models for Chemistry |
Text + BioMulti Arxiv 2025 SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines |
Text + BioMulti Arxiv 2025 MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design |

Datasets & Benchmarks

Pre-training- Text PubMed
Pre-training- Text bioRxiv
Pre-training- Text MedRxiv
Pre-training- Text S2ORC
Pre-training- Text MIMIC
Pre-training- Text UF Health
Pre-training- Text Elsevier Corpus
Pre-training- Text Eurpoe PMC
Pre-training- Text LibreText
Pre-training- Text NLM literature archive
Pre-training- Text GAP-Replay
Pre-training- Molecule ZINC
Pre-training- Protein UniProt
Pre-training- Molecule, Bioassay ChEMBL
Pre-training- Molecule, Bioassay GIMLET
Pre-training- Text, Molecule mCLM
Pre-training- Text, Molecule PEIT-GEN
Pre-training- Text, Molecule, IUPAC, etc PubChem
Pre-training- Text, Protein InterPT
Pre-training- Text, Protein, etc STRING
Pre-training- Text, Molecule MolTextNet
Pre-training- Text, Protein ProtAnnotation
Pre-training- Text, Protein ProteinKG25
Pre-training- Text, Protein ProtAnno
Pre-training- Text, Protein ProtDescribe
Pre-training- Text, Protein Proteinaligner pretrain data
Pre-training- Text, Protein ProTAD
Pre-training- Text, Protein Molinst-SwissProtCLAP
Fine-tuning- Text BLURB
Fine-tuning- Text PubMedQA
Fine-tuning- Text SciQ
Fine-tuning- Text BioASQ
Fine-tuning- Text MedC-I
Fine-tuning- Molecule MoleculeNet
Fine-tuning- Molecule MoleculeACE
Fine-tuning- Molecule TDC
Fine-tuning- Molecule USPTO
Fine-tuning- Molecule Graph2graph
Fine-tuning- Molecule PubChem Molecule Optimization
Fine-tuning- Protein PEER
Fine-tuning- Protein FLIP
Fine-tuning- Protein TAPE
Fine-tuning- Text, Molecule PubChemSTM
Fine-tuning- Text, Molecule PseudoMD-1M
Fine-tuning- Text, Molecule ChEBI-20
Fine-tuning- Text, Molecule ChEBI-20-MM
Fine-tuning- Text, Molecule ChEBL-dia
Fine-tuning- Text, Molecule PCdes
Fine-tuning- Text, Molecule MoMu
Fine-tuning- Text, Molecule PubChemQA
Fine-tuning- Text, Molecule 3D-MolT
Fine-tuning- Text, Molecule MoleculeQA
Fine-tuning- Text, Molecule MolTextQA
Fine-tuning- Text, Molecule MolOpt-Instructions
Fine-tuning- Text, Molecule SMolInstruct
Fine-tuning- Text, Molecule PubChem324k
Fine-tuning- Text, Molecule KnowMol-100K
Fine-tuning- Text, Molecule MolQA
Fine-tuning- Text, Molecule MolTC
Fine-tuning- Text, Molecule Mol-LLaMA-Instruct
Fine-tuning- Text, Molecule PEIT-LLM
Fine-tuning- Text, Molecule SmileyLlama
Fine-tuning- Text, Molecule MuMOInstruct
Fine-tuning- Text, Molecule HiPubChem
Fine-tuning- Text, Molecule ExDDI
Fine-tuning- Text, Molecule MMP
Fine-tuning- Text, Molecule SLM4CRP_with_RTs
Fine-tuning- Text, Molecule, etc SciAssess
Fine-tuning- Text, Molecule, etc DrugBank
Fine-tuning- Text, Molecule, etc DARWIN
Fine-tuning- Text, Protein SwissProt
Fine-tuning- Text, Protein UniProtQA
Fine-tuning- Text, Protein InstructProtein
Fine-tuning- Text, Protein Open Protein Instructions
Fine-tuning- Text, Protein PDB-QA
Fine-tuning- Text, Protein ProteinQA
Fine-tuning- Text, Protein Protein2Text-QA
Fine-tuning- Text, Protein CAMEO
Fine-tuning- Text, Protein Swiss-Prot Curated Triplets
Fine-tuning- Text, Protein ProCyon-Instruct
Fine-tuning- Text, Molecule, Protein Mol-Instructions
Fine-tuning- Text, Molecule, Protein Biology-Instructions
Benchmark- Text SciEval
Benchmark- Text BioInfo-Bench
Benchmark- Text BioMedEval
Benchmark- Text, Molecule ChemLLMBench
Benchmark- Text, Molecule AI4Chem
Benchmark- Text, Molecule GPTChem
Benchmark- Text, Molecule, etc StructChem
Benchmark- Text, Molecule S²-Bench
Benchmark- Text, Molecule MotifHallu
Benchmark- Text, Molecule MolCap-Arena
Benchmark- Text, Molecule ChemCoTBench
Benchmark- Text, Molecule MolLangBench
Benchmark- Text, Molecule MolErr2Fix
Benchmark- Text, Molecule MolPuzzle
Benchmark- Text, Protein ProteinLMBench
Benchmark- Text, Protein Prot-Inst-OOD

Related Resources

Related Surveys & Evaluations

Multimodal Pre-training Models of Molecular Representation for Drug Discovery NSF_2511
A Comprehensive Survey of Multimodal LLMs for Scientific Discovery VLM4RWD@NeurIPS 2511
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers Arxiv_2508
A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization Arxiv_2505
From Generalist to Specialist: A Survey of Large Language Models for Chemistry Arxiv_2412
A Review of Large Language Models and Autonomous Agents in Chemistry Arxiv_2407
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery Arxiv 2406
Bridging Text and Molecule: A Survey on Multimodal Frameworks for Molecule Arxiv 2403
Bioinformatics and Biomedical Informatics with ChatGPT: Year One Review Arxiv 2403
From Words to Molecules: A Survey of Large Language Models in Chemistry Arxiv 2402
Scientific Language Modeling: A Quantitative Review of Large Language Models in Molecular Science Arxiv 2402
Progress and Opportunities of Foundation Models in Bioinformatics Arxiv 2402
Scientific Large Language Models: A Survey on Biological & Chemical Domains ACM Computing Surveys
The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4 Arxiv 2311
Transformers and Large Language Models for Chemistry and Drug Discovery Arxiv 2310
Language models in molecular discovery Arxiv 2309
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks NeurIPS 2309
Do Large Language Models Understand Chemistry? A Conversation with ChatGPT JCIM 2303
A Systematic Survey of Chemical Pre-trained Models IJCAI 2023

Related Workshop

Language + Molecules @ ACL 2024 Workshop

Related Repositories

Acknowledgements

This repository is contributed and updated by QizhiPei and Lijun Wu. If you have questions, don't hesitate to open an issue or ask me via [email protected] or Lijun Wu via [email protected]. We are happy to hear from you!

Citations

@article{pei2024leveraging,
  title={Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey},
  author={Pei, Qizhi and Wu, Lijun and Gao, Kaiyuan and Zhu, Jinhua and Wang, Yue and Wang, Zun and Qin, Tao and Yan, Rui},
  journal={arXiv preprint arXiv:2403.01528},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
figs		figs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬📝 Awesome Biomolecule-Language Cross Modeling

Table of Content

Models

Datasets & Benchmarks

Related Resources

Related Surveys & Evaluations

Related Workshop

Related Repositories

Acknowledgements

Citations

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 6

Uh oh!

License

QizhiPei/Awesome-Biomolecule-Language-Cross-Modeling

Folders and files

Latest commit

History

Repository files navigation

🧬📝 Awesome Biomolecule-Language Cross Modeling

Table of Content

Models

Datasets & Benchmarks

Related Resources

Related Surveys & Evaluations

Related Workshop

Related Repositories

Acknowledgements

Citations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 6

Uh oh!

Packages