Goldfish: Monolingual Language Models for 350 Languages

Goldfish is a suite of 1154 monolingual language models trained for 350 languages. The models are trained on 5MB, 10MB, 100MB, and 1GB of text in each language when available, after accounting for the fact that some languages require more UTF-8 bytes than others to encode comparable text. When 1GB of text is not available for a language, we also release a "full" model trained on our entire dataset for that language. The Goldfish reach lower perplexities than state-of-the-art multilingual models for many low-resource languages (Chang et al., 2024), and they can be used as baselines, fine-tuning sources, or augmentations to larger models for low-resource NLP research. Google Colab demo here (no technical background required!).

For training and evaluations details, see our paper, Goldfish: Monolingual Language Models for 350 Languages (Chang et al., 2024). In the current repository, we include the original training and evaluation code, dataset and evaluation info (data directory), and model details (model_details.json).

To use the Goldfish models, we recommend using the models available on Hugging Face: https://huggingface.co/goldfish-models

We provide sample code in example_generate_text.py and example_score_text.py, or in a browser through this Google Colab. Each Goldfish model is 125M parameters at the largest, which can easily be run on a free Google Colab GPU or even on CPU.

Citation.

@article{chang-etal-2024-goldfish,
  title={Goldfish: Monolingual Language Models for 350 Languages},
  author={Chang, Tyler A. and Arnett, Catherine and Tu, Zhuowen and Bergen, Benjamin K.},
  journal={Preprint},
  year={2024},
  url={https://www.arxiv.org/abs/2408.10441},
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
eval_code		eval_code
training_code		training_code
README.md		README.md
constants.py		constants.py
example_generate_text.py		example_generate_text.py
example_score_text.py		example_score_text.py
goldfish_map.png		goldfish_map.png
goldfish_paper_20240815.pdf		goldfish_paper_20240815.pdf
model_details.json		model_details.json
update_readme.py		update_readme.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Goldfish: Monolingual Language Models for 350 Languages

Citation.

About

Uh oh!

Releases

Packages

Languages

tylerachang/goldfish

Folders and files

Latest commit

History

Repository files navigation

Goldfish: Monolingual Language Models for 350 Languages

Citation.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages