Skip to content

tylerachang/goldfish

Repository files navigation

Goldfish: Monolingual Language Models for 350 Languages

Goldfish is a suite of 1154 monolingual language models trained for 350 languages. The models are trained on 5MB, 10MB, 100MB, and 1GB of text in each language when available, after accounting for the fact that some languages require more UTF-8 bytes than others to encode comparable text. When 1GB of text is not available for a language, we also release a "full" model trained on our entire dataset for that language. The Goldfish reach lower perplexities than state-of-the-art multilingual models for many low-resource languages (Chang et al., 2024), and they can be used as baselines, fine-tuning sources, or augmentations to larger models for low-resource NLP research. Google Colab demo here (no technical background required!).

Goldfish map.

For training and evaluations details, see our paper, Goldfish: Monolingual Language Models for 350 Languages (Chang et al., 2024). In the current repository, we include the original training and evaluation code, dataset and evaluation info (data directory), and model details (model_details.json).

To use the Goldfish models, we recommend using the models available on Hugging Face: https://huggingface.co/goldfish-models

We provide sample code in example_generate_text.py and example_score_text.py, or in a browser through this Google Colab. Each Goldfish model is 125M parameters at the largest, which can easily be run on a free Google Colab GPU or even on CPU.

Citation.

@article{chang-etal-2024-goldfish,
  title={Goldfish: Monolingual Language Models for 350 Languages},
  author={Chang, Tyler A. and Arnett, Catherine and Tu, Zhuowen and Bergen, Benjamin K.},
  journal={Preprint},
  year={2024},
  url={https://www.arxiv.org/abs/2408.10441},
}

About

Goldfish: Monolingual language models for 350 languages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published