Goldfish is a suite of 1154 monolingual language models trained for 350 languages. The models are trained on 5MB, 10MB, 100MB, and 1GB of text in each language when available, after accounting for the fact that some languages require more UTF-8 bytes than others to encode comparable text. When 1GB of text is not available for a language, we also release a "full" model trained on our entire dataset for that language. The Goldfish reach lower perplexities than state-of-the-art multilingual models for many low-resource languages (Chang et al., 2024), and they can be used as baselines, fine-tuning sources, or augmentations to larger models for low-resource NLP research. Google Colab demo here (no technical background required!).
For training and evaluations details, see our paper, Goldfish: Monolingual Language Models for 350 Languages (Chang et al., 2024).
In the current repository, we include the original training and evaluation code, dataset and evaluation info (data directory), and model details (model_details.json).
To use the Goldfish models, we recommend using the models available on Hugging Face: https://huggingface.co/goldfish-models
We provide sample code in example_generate_text.py and example_score_text.py, or in a browser through this Google Colab.
Each Goldfish model is 125M parameters at the largest, which can easily be run on a free Google Colab GPU or even on CPU.
@article{chang-etal-2024-goldfish,
title={Goldfish: Monolingual Language Models for 350 Languages},
author={Chang, Tyler A. and Arnett, Catherine and Tu, Zhuowen and Bergen, Benjamin K.},
journal={Preprint},
year={2024},
url={https://www.arxiv.org/abs/2408.10441},
}