A Survey on Language, Multimodal, and Scientific GPT Models: Examing User-Friendly and Open-Sourced Large GPT Models
Continuously updating
The original paper is released on arxiv.
The advent of GPT models has brought about a significant transformation in the field of NLP. These models, such as GPT-4, demonstrate exceptional capabilities in various NLP tasks. However, despite their impressive capabilities, large GPT models have inherent limitations that restrict their widespread adoption, usability, and fine-tuning. The need for user-friendly, relatively small, and open-sourced alternative GPT models arises from the desire to overcome these limitations while retaining high performance. In this survey paper, we provide an examination of alternative open-sourced models of large GPTs, focusing on user-friendly and relatively small models (near 10B) that facilitate easier deployment and accessibility.
- Investigate the architecture, design principles, and trade-offs of user-friendly and relatively small alternative GPT models, focusing on their ability to overcome the challenges posed by large GPT models.
- Present the data collection and analyze the pre-training data source, data quality, quantity, diversity, and finetuning data including instruction data, alignment data, and also the domain-specific data for domain-specific models.
- Survey the techniques for efficient deployment and fine-tuning of these GPT models.
- Introduce ongoing open-source projects and initiatives for user-friendly GPT model reproduction and deployment.
- Provide a thorough analysis of benchmark evaluations and offer human evaluations of these relatively small GPT models to give some human-liked recommendations in real usage.
- Explore the extension of GPT models to multimodal settings, focusing on models that integrate NLP with computer vision, and also place special focus on user-friendly scientific GPT models and biomedical domains
The overview of the content is shown in Figure 1.

Related papers/links for open LLMs (List is updating)
Language Domain
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 2020. [paper] [code & models] [Huggingface models]
- mT5: A massively multilingual pre-trained text-to-text transformer. NAACL 2021. [paper] [code & models] [Huggingface models]
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. [code & models] [Huggingface models]
- Gpt-neox-20b: An open-source autoregressive language model. arxiv 2022. [paper] [code] [original models] [Huggingface models]
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. [code & models] [Huggingface models]
- Opt: Open pre-trained transformer language models. arxiv 2022. [paper] [code] [Huggingface models]
- BLOOM: A 176b-parameter open-access multilingual language model. arxiv 2022. [paper] [Huggingface models]
- Crosslingual Generalization through Multitask Finetuning. arxiv 2022. [paper] [Huggingface models]
- Glm: General language model pretraining with autoregressive blank infilling. ACL 2022. [paper] [code & models] [Huggingface models]
- GLM-130B: An Open Bilingual Pre-trained Model. ICLR 2023. [paper] [code & models]
- ChatGLM-6B [code & models] [Huggingface models]
- ChatGLM2-6B [code & models] [Huggingface models]
- LLaMA: Open and Efficient Foundation Language Models. arxiv 2023. [paper] [code & models]
- OpenLLaMA: An Open Reproduction of LLaMA. [code & models]
- Stanford Alpaca: An Instruction-following LLaMA Model. [code & models]
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. [blog] [code & models]
- StableLM: Stability AI Language Models. [code & models]
- Baize. [code & models]
- Koala: A Dialogue Model for Academic Research. [blog] [code & models]
- WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions. [code & models]
- Large-scale, Informative, and Diverse Multi-round Dialogue Data, and Models. [code & models]
- YuLan-Chat: An Open-Source Bilingual Chatbot. [code & models]
- Pythia: Interpreting Transformers Across Time and Scale. arxiv 2023. [paper] [code & models]
- Dolly. [code & models]
- OpenChatKit. [code & models]
- BELLE: Be Everyone's Large Language model Engine. [code & models]
- RWKV: Reinventing RNNs for the Transformer Era. arxiv 2023. [paper] [code & models] [Huggingface models]
- ChatRWKV. [code & models]
- MOSS. [code & models]
- RedPajama-INCITE. [blog] [Huggingface models]
- Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. [blog] [code] [Huggingface models]
- Introducing Falcon LLM. [blog] [Huggingface models]
- InternLM. [code & models]
- Baichuan-7B. [code & models]
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arxiv 2023. [paper] [code & models]
- Introducing Qwen-7B: Open foundation and human-aligned models. code & models]
- XVERSE-13B. [code & models]
Multimodal Domain
- Flamingo: a Visual Language Model for Few-Shot Learning. NeurIPS 2022. [paper]
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arxiv 2023. [paper] [code]
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arxiv 2023. [paper] [website, code & models]
- Visual Instruction Tuning. arxiv 2023. [paper] [website, code & models]
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arxiv 2023. [paper] [code & models]
- Transfer Visual Prompt Generator across LLMs. arxiv 2023. [paper] [webste, code & models]
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. arxiv 2023. [paper] [code & models]
- MultiModal-GPT: A Vision and Language Model for Dialogue with Humans. arxiv 2023. [paper] [code & models]
Scientific Domain
- BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. Bioinformatics 2022. [paper] [code & models]
- Galactica: A Large Language Model for Science. arxiv 2022. [paper] [models]
- BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks. arxiv 2023. [paper] [code & models]
- MolXPT: Wrapping Molecules with Text for Generative Pre-training. ACL 2023. [paper] [code & models]
- Translation between Molecules and Natural Language. EMNLP 2022. [paper] [code & models]
Table 1. Statistical overview of open large language models in recent years, categorized by base models Outer pipes Cell padding No sorting
| Model | #Param | Backbone | Release Date | Training Data Size |
|---|---|---|---|---|
| T5 (enc-dec) [github] | 60M, 220M, 770M, 3B, 11B | Base Model | 2019-10 | 1T tokens |
| mT5 (enc-dec) [github] | 300M, 580M, 1.2B, 3.7B, 13B | Base Model | 2020-10 | 1T tokens |
| GPT-Neo [github] | 125M, 350M, 1.3B, 2.7B | Base Model | 2021-03 | 825GB |
| GPT-NeoX [github] | 20B | Base Model | 2022-02 | 825GB |
| GPT-J [github] | 6B | Base Model | 2021-06 | 825GB |
| OPT [github] | 125M, 1.3B, 2.7B, 6.7B, 13B, 30B, 66B, 175B | Base Model | 2022-05 | 180B tokens |
| BLOOM | 560M, 1.1B, 1B7, 3B, 7.1B, 176B | Base Model | 2022-07 | 366B tokens |
| BLOOMZ | 560M, 1.1B, 1B7, 3B, 7.1B, 176B | BLOOM | 2022-11 | - |
| GLM [github] | 110M, 335M, 410M, 515M, 2B, 10B, 130B | Base Model | 2021-03 | |
| English Wikipedia | - | |||
| GLM-130B [github] | 130B | Base Model | 2022-08 | - |
| ChatGLM [github] | 6B | GLM | 2023-03 | - |
| ChatGLM2 [github] | 6B | GLM | 2023-06 | - |
| LLaMA [github] | 7B, 13B, 33B, 65B | Base Model | 2023-02 | 1.4T tokens |
| OpenLLaMA [github] | 3B, 7B | Replicate of LLaMA | 2023-05 | |
| Alpaca [github] | 7B | LLaMA | 2023-03 | 52K |
| Vicuna [github] | 7B, 13B | LLaMA | 2023-03 | 70K |
| StableVicuna [github] | 13B | LLaMA | Vicuna | - |
| BAIZE [github] | 7B, 13B, 30B | LLaMA | 2023-04 | 54K/57K/47K |
| Koala [github] | 13B | LLaMA | 2023-04 | - |
| WizardLM [github] | 7B, 13B, 30B | LLaMA | 2023-06 | 250k/70k |
| UltraLM [github] | 13B | LLaMA | 2023-06 | - |
| Pythia [github] | 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B | Base Model | 2023-01 | 299.9B tokens/207B tokens |
| Dolly-v2 [github] | 12B | Pythia | 2023-04 | \textasciitilde 15k |
| Openchatkit [github] | 7B | Pythia | 2023-03 | |
| BELLE-7B [github] | 7B | Pythia | 2023-03 | 1.5M |
| StableLM-Alpha [github] | 3B, 7B | Base Model | 2023-04 | 1.5T tokens |
| StableLM-Tuned-Alpha [github] | 7B | StableLM | 2023-04 | - |
| RWKV [github] | 169M, 430M, 1.5B, 3B, 7B, 14B | Base Model | - | 825GB |
| ChatRWKV [github] | 7B, 14B | RWKV | 2022-12 | - |
| moss-moon-003-base [github] | 16B | base model | 2023-04 | 700B tokens |
| moss-moon-003-sft [github] | 16B | moss-moon-003-base | 2023-04 | 1.1 million |
| RedPajama-INCITE | 3B, 7B | Base Model | 2023-05 | 1.2T tokens |
| MPT-7B [github] | 7B | Base Model | 2023-05 | 1T tokens |
| MPT-7B-Chat [github] | 7B | MPT-7B | 2023-05 | - |
| Falcon LLM | 7B, 40B | Base Model | 2023-06 | 1T tokens |
| InternLM [github] | 7B | Base Model | 2023-06 | trillions of tokens |
| InternLM Chat [github] | 7B | InternLM | 2023-06 | - |
| Baichuan [github] | 7B | Base Model | 2023-06 | 1.2T tokens |
| LLAMA 2 [github] | 7B, 13B, 70B | Base Model | 2023-07 | 2T tokens |
| LLAMA 2-CHAT [github] | 7B, 13B, 70B | LLAMA 2 | 2023-07 | 27,540 instruction tuning data, 2,919,326 human preference data |
| Qwen [github] | 7B | Base Model | 2023-08 | 2.2T tokens |
| Qwen-Chat [github] | 7B | Qwen | 2023-08 | - |
- C4 (https://www.tensorflow.org/datasets/catalog/c4), mC4 (https://www.tensorflow.org/datasets/catalog/c4#c4multilingual_nights_stay)
- The Pile (https://pile.eleuther.ai/)
- ROOTS corpus
- xP3 (extended from P3) (https://huggingface.co/datasets/bigscience/xP3)
- BooksCorpus
- English CommonCrawl, Github, Wikipedia, Gutenberg Books3, Stack Exchange
- Quora, StackOverflow, MedQuAD
- ShareGPT (https://sharegpt.com), HC3 (https://huggingface.co/datasets/Hello-SimpleAI/HC3)
- Stanford's Alpaca (https://huggingface.co/datasets/tatsu-lab/alpaca), Nomic-AI's gpt4all, Databricks labs' Dolly, and Anthropic's HH
- UltraChat
- OIG (https://laion.ai/blog/oig-dataset/) dataset
- BELLE's Chinese Dataset (https://github.com/LianjiaTech/BELLE/tree/main/data/1.5M)
- RedPajama-Data
Related papers
- ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. arxiv 2022. [paper]
- LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models. arxiv 2022. [paper]
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. Neurips 2022. [paper]
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023. [paper]
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. ICML 2023. [paper]
- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arxiv 2023. [paper]
Related papers
- Parameter-Efficient Transfer Learning for NLP. ICML 2019. [paper]
- LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. [paper]
- The power of scale for parameter-efficient prompt tuning. EMNLP 2021. [paper]
- GPT Understands, Too. arxiv 2021. [paper]
- Prefix-Tuning: Optimizing Continuous Prompts for Generation. ACL 2021. [paper]
- P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. ACL 2022. [paper]
- QLoRA: Efficient Finetuning of Quantized LLMs. arxiv 2023. [paper]
TABLE 5: Overview of open-source efforts and tools development I have swapped the "tool" and "category" columns in the markdown table as requested:
Upcoming soon ...
TABLE 16. ChatGPT Alternatives on Different Applications
If you find our paper/repository useful, please kindly cite our paper.
@misc{gao2023examining,
title={Examining User-Friendly and Open-Sourced Large GPT Models: A Survey on Language, Multimodal, and Scientific GPT Models},
author={Kaiyuan Gao and Sunan He and Zhenyu He and Jiacheng Lin and QiZhi Pei and Jie Shao and Wei Zhang},
year={2023},
eprint={2308.14149},
archivePrefix={arXiv},
primaryClass={cs.CL}
}