Do Not Worry If You Do Not Have Data
Do Not Worry If You Do Not Have Data
** 73.78 75.67
Synthetic 78.47
NICT, Japan Un ltered + 10%
* 74.60 76.63
Synthetic 79.55
{meetdoshi,pb}@[Link] Filtered
** 75.83 77.52
Synthetic 80.23
{prajdabre}@[Link] + 10%
81
In this paper, we explore the utility of Transla-
tionese as synthetic data created using machine
78.5
arXiv:2403.13638v2 [[Link]] 21 Mar 2024
3 Methodology
In this section, we describe our framework for lever-
aging synthetic data for LM training. This con-
sists of monolingual data curation from the web
(clean), training a TinyLM with it, translation of
clean data, using the aforementioned TinyLM to
filter synthetic data, and then using this filtered data
for training a larger LM to be used for downstream
tasks. Our framework is described in Figure 2.
Figure 3: Language-wise corpora size comparison with
3.1 Monolingual Data
IndicCorpv2 (Doddapaneni et al., 2023): Stacked Bars
Web Crawled (Clean): Following Doddapaneni
et al. (2023); Rae et al. (2022); Team et al. (2022),
for all languages of interest, we a. obtain a list of the filtered corpus with Wikipedia, OSCAR (Ortiz
URLs to be crawled via word-level n-grams passed Suárez et al., 2019) and some dumps of mC4 (Xue
to a search engine, b. after URL deduplication, we et al., 2021) and finally, g. perform deduplication
crawl all applicable webpages, c. automatically at paragraph level using Murmurhash algorithm4
and manually (Ortiz Suárez et al., 2019; Abadji with a 128-bit unsigned hash for each monolingual
et al., 2022) filter out unwanted text like HTML split of the corpora.
tags and emoticons, d. use language detection- Translationese (Synthetic): We utilize state-of-
based (LID) filtering using cld33 and IndicLID- the-art MT models like IndicTrans2 (Gala et al.,
FTN model (Madhani et al., 2023a) to discard lan- 2023) to generate translationese data. We use beam
guages not of interest, e. perform document fil- search with a beam value of 5 to translate English
tering to remove offensive text using toxic words tokens from the aforementioned crawled corpus to
list provided by Team et al. (2022), f. merge all the languages of interest. Most MT models have a
3 4
[Link] [Link]
Figure 4: The plot illustrates TinyLM’s perplexity mean and variance across various datasets: Clean-EN (left),
Syn-EN from filtered Hindi (middle), and Syn-EN from filtered Gujarati (right). Despite filtering, English documents
generated from translating Gujarati show consistently higher variance.
maximum token limit, and thus we split the docu- fine-tune them for natural language understand-
ments using Moses Sentence Splitter5 to perform ing (NLU) tasks such as IndicGLUE (Kakwani
translations into the target language at the sentence et al., 2020), GLUE (Wang et al., 2018) and genera-
level and then merge again to form documents. Our tion (NLG) benchmarks such as IndicNLG (Kumar
experiments also focus on synthetic English data et al., 2022), summarization tasks (Nallapati et al.,
translated from Hindi and Gujarati. 2016), (Chen et al., 2021) and machine translation
benchmarks (Team et al., 2022), (Gala et al., 2023).
3.2 Tiny Language Models (TinyLMs)
TinyLMs are simply tiny versions of language mod- 4 IndicMonoDoc
els inspired by Eldan and Li (2023). We follow Following the monolingual data curation strategy
the Transformer architecture (Vaswani et al., 2017) in Section 3, we crawl data for English and 22 Indic
used by Eldan and Li (2023) and train it using only languages. As a result, we end up with IndicMon-
clean monolingual documents. Instead of using oDoc, with 27.5 billion tokens of Indic documents
learned positional encodings, we use RoPE em- and 12 billion tokens of English documents for a
beddings (Su et al., 2023) for better extrapolation total of 39.5 billion tokens of clean monolingual
to longer documents. We rely on Chinchilla scal- data. This is the largest ever for Indic languages,
ing laws Hoffmann et al. (2022) and use compute even surpassing (Doddapaneni et al., 2023) by 2
optimal word tokens to train our models. times. We use IndicMonoDoc for all clean parts
3.3 Synthetic Data Filtering of our experiments. We report additional details of
IndicMonoDoc in Appendix D.
We first train TinyLMs on crawled data and then
use them to compute perplexities on our synthetic 4.1 Analysis of Crawled Corpora
documents (W ∈ w1 , w2 , . . . , wN ) using the equa- Figure 3 gives an overview of the comparison of
tion: IndicMonoDoc which is a document-level corpus
( N
) with that of IndicCorpV2 which is a sentence-level
1 X
PPL(W ) = exp − log pθ (wi | w<i ) corpus. It is important to note that we paid special
N attention to the low-resource languages.
i
We also use our TinyLM to repair sentences in doc- 4.2 Analysis of Synthetic Data
uments that exceed the maximum length of the MT We use IndicMonoDoc for the clean part of our
models, but this happens in only 0.002% of such experiments and translate parts of it for the syn-
cases. We provide more details of our approach in thetic experiments. Figure 4 shows the perplexity
Appendix C. mean and variance scores for TinyLM across token
3.4 Final LM and Downstream Task Training positions in the documents. This shows that on
unseen documents, TinyLM shows higher variance
Using the filtered synthetic corpora, we train our
on English documents generated by translating Gu-
final LMs which are comparatively larger, and
jarati documents from IndicMonoDoc as compared
5
[Link] to English clean and English synthetic generated
syn-XX_yy-unfiltered: Denotes synthetic mono-
lingual documents in XX language generated by
using yy as source during translation.
syn-XX_yy-filtered: Filtered synthetic data.
+10%: Refers to extended pretraining on a cleaned
subset of IndicMonoDoc with an additional 10%
tokens compared to regular training.
BI-XX-YY Prefix: Denotes bilingual models
trained using an equal mixture of monolingual cor-
pora in XX and YY languages. We append an
_syn prefix to either XX or YY if a synthetic ver-
sion of that language is employed in training and a
Figure 5: Violin plot displaying the distribution of
lengths of clean and filtered English documents on dif- -parallel/nonparallel tag to denote whether a paral-
ferent data splits: en-clean (English web documents), lel version of XX and YY are used or not.
syn-en_hi (synthetic English documents translated from Note, for each split we only use the number of
Hindi), and syn-en_gu (synthetic English documents tokens that are required to reach the point of op-
translated from Gujarati). timality (Hoffmann et al., 2022) by the language
model. We mention other details in Appendix B.
from Hindi. This also gives reason for the deteriora-
5.2 Implementation and Training
tion in results in Table 4 due to Gujarati documents.
Figure 5 shows the distribution of lengths of fil- Tokenizer: We use a common byte-pair-encoding
tered documents by TinyLMs showing that they (BPE) (Sennrich et al., 2016b) tokenizer using Sen-
do not add any bias for shorter documents during tencepiece6 for all experiments. We train a shared
filtering. Although we do not experiment with all vocabulary of 56k subwords between three lan-
these languages, we believe that IndicMonoDoc guages, English, Hindi, and Gujarati by using 5
will be an invaluable resource for Indic LMs. Million randomly sampled sentences per language
and upsampling for Gujarati.
5 Experiments TinyLMs: We use Pytorch Lightning7 for our im-
plementations and train TinyLMs as described in
In this section, we describe the training procedure Section 3.2. We use hidden sizes of 768 and have
and datasets used for different models mentioned two variants, one with 4 layers (mini) and one with
in Section 3. We pretrain and fine-tune all of the 12 layers (base; same as GPT2-base) with 28M
mentioned models from scratch in both mono and and 85M non-embedding parameters respectively.
bilingual settings using a Causal Language Model- The mini models are trained on clean data with
ing (CLM) objective on NLG tasks and use a Lin- sequence lengths of 40968 (mini-4k) for filtering
ear classification head for all classification tasks. synthetic documents as described in Section 3.3.
We specify the sample of the dataset used for pre- On the other hand, for our main pre-training and
training and finetuning for each model and see the downstream fine-tuning experiments, we train mini
different effects of using synthetic corpora for pre- and base models with sequence lengths of 1024
training. (mini-1k and base-1k). Following Hoffmann et al.
(2022) we use 2.4 billion word tokens per language
5.1 Pretraining Data Settings
to compute optimal training of base models. Since
We refer to translated text or translationese as syn- Gujarati has only 900M tokens in our dataset, when-
thetic or syn and original or web-crawled data as ever Gujarati is involved, we train only the mini-1k
clean throughout our experiments. For the pre- model. For models involving English and Hindi,
training of all base models, we use the following we train both mini and base models. Additional
naming convention to denote our training splits for details are in Appendix B.
each model:
6
XX-clean: This is a clean subset sampled randomly [Link]
7
[Link]
from IndicMonoDoc where XX represents the lan- 8
We keep long sequence lengths to be able to handle long
guage English (EN), Hindi (HI) or Gujarati (GU). documents for filtering.
(a) Results on Hindi
NLU NLG
Model Headline Sentence Question
iXNLI bbc-a iitp-mr iitp-pr midas Avg. Wikibio Avg.
Gen. Summ. Gen.
HI-clean 73.61 81.75 72.58 79.73 80.34 77.60 27.54 23.64 24.84 52.16 32.04
syn-HI_en-unfiltered 72.87 77.92 64.36 76.22 79.91 74.26 27.29 22.93 24.22 50.14 31.14
syn-HI_en-unfiltered+10% 74.63 78.36 67.75 77.46 80.17 75.67 - - - - -
syn-HI_en-filtered 74.75 81.06 69.03 78.58 79.73 76.63 27.15 23.10 24.41 49.88 31.13
syn-HI_en-filtered+10% 74.49 80.94 71.61 79.92 80.64 77.52 - - - - -
Table 1: Results for Hindi and Gujarati: NLU/NLG tasks on base-1k (Hindi) and mini-1k (Gujarati) models on
different clean and synthetic splits. Test accuracy for NLU tasks; Rouge-L F1 scores for NLG tasks. Bold values
represent the best amongst synthetic splits.
5.3 Downstream Tasks and Evaluation translations. We demonstrate the impact of filter-
We finetune the mini-1k and base-1k models for ing and adding additional clean data for extended
various classification, regression, and generation pretraining of LMs trained solely on synthetic text.
tasks. We do some hyperparameter tuning for each Additionally, we observe the effect of using the
task and then repeat them for different data splits. clean source text along with its translations (syn-
More hyperparameters and evaluation details can thetic parallel documents) on downstream tasks.
be found in Appendix B. For evaluations, we re- We follow the naming convention for different data
port our primary scores on IndicGLUE (Kakwani splits as specified in Section 5. We provide details
et al., 2020) and IndicXNLI (iXNLI) (Aggarwal for pretraining of each model in Appendix B.
et al., 2022) for Hindi and Gujarati and use the Filtered Synthetic Data is Competitive with
validation set of GLUE benchmark (Wang et al., Web Scraped Data: The results in Table 1 and
2018) for English. We also experiment with other 2 indicate that syn-HI_en-unfiltered, syn-GU_en-
generation tasks like CNN-Dailymail (Nallapati unfiltered, and syn-EN_hi-unfiltered exhibit lower
et al., 2016), DailogSum (Chen et al., 2021), XL- downstream performance compared to their fil-
Sum (Hasan et al., 2021), IndicNLG9 (Kumar et al., tered counterparts: syn-HI_en-filtered, syn-GU_en-
2022), FLoRes-200 (Team et al., 2022), IN22-Conv filtered, and syn-EN_hi-filtered, respectively. It is
& IN22-Gen (Gala et al., 2023) and use standard evident that filtering the synthetic documents using
evaluation metrics suitable for each task like ac- TinyLMs significantly improves the performance of
curacy, f1-score, Rouge-L (Lin, 2004) and chrF++ both NLU and NLG tasks. In Table 2, we observe
(Popović, 2017). that for tasks like CoLA (Warstadt et al., 2019),
language models trained solely on synthetic data
6 Results lag behind when compared to other tasks. This
suggests that synthetic corpora may lack certain
We now present our results which help establish important elements necessary for language models
the utility of synthetic data for language modeling. to perform competitively in linguistic acceptability
tasks, as opposed to LMs trained on clean, non-
6.1 Main Results
synthetic corpora.
In this section, we present results for Hindi, Gu- Fine-tuning on Web Scraped Data boosts per-
jarati, and English language models trained on formance: Even after filtering, we observe that
clean data, as well as synthetic data generated from language models trained solely on synthetic text
9
We only take first 4k examples of IndicNLG test split for slightly underperform LMs trained on clean text.
each task due the large test split of IndicNLG To address this issue, we conduct extended pretrain-
sst2 cola mrpc qnli qqp rte mnli-m mnli-mm stsb
Model Avg.
acc mcc f1 acc f1 acc acc acc pearson
EN-clean 90.94 40.26 87.4 84.98 84.47 65.34 77.84 77.96 82.67 76.87
syn-EN_hi-unfiltered 84.61 31.1 81.78 79.35 81.44 63.3 72.94 73.16 78.9 71.84
Mono syn-EN_hi-unfiltered + 10% 87.39 34.22 85.77 80.96 81.07 65.11 74.76 74.38 80.32 73.78
syn-EN_hi-filtered 88.3 34.03 86.55 83.59 83.64 63.17 75.6 75.41 81.1 74.60
syn-EN_hi-filtered + 10% 90.13 35.75 86.41 84.75 84.21 65.34 76.99 76.91 81.95 75.83
BI-EN-HI-clean 89.56 38.53 85.56 84.88 84.39 64.25 76.4 77.27 82.07 75.88
BI-EN-HI_syn-parallel-filtered 89.56 39.57 85.71 84.75 84.62 64.98 77.31 77.85 82.41 76.31
Bi BI-EN-HI_syn-nonparallel-filtered 89.79 38.68 86.92 85.08 84.06 65.34 77.15 77.55 83.01 76.40
BI-EN_syn-HI_syn-filtered 87.95 30.05 84.9 83.7 83.97 63.89 75.63 76.24 82.24 74.29
BI-EN_syn-HI_syn-filtered + 10% 89.1 35.45 85.34 84.53 84.18 65.7 76.64 77.24 82.1 75.59
Table 2: Results on English: Dev set of GLUE tasks for different synthetic splits on the base-1k model. Synthetic
LMs perform almost as well as clean LMs after filtering and further training with clean data. Bold values represent
the best amongst synthetic splits.
ing of LMs using clean data sourced from Indic- HI-clean model which is solely trained on clean
MonoDoc. The objective is to determine if this corpora. This implies that it is possible to train
additional training improves performance. We only multilingual models where some languages are
incorporate an additional 10% of clean data com- trained only over a clean subset and others on
pared to the LM’s previous training data. We see synthetic without deteriorating performance across
these results across all three languages, and for languages. We further see that using parallel data
Hindi and Gujarati, we see that by incorporating does not have much impact on multilingual models.
even a small amount of clean data, we observe
an increase in performance on downstream tasks, 6.2 Further Exploration
bringing the LM at par or closer to the performance Impact of source language for synthetic data
of a clean LM. We see an improvement in LMs generation: Choosing the right source language
trained using unfiltered synthetic corpora as well for synthetic corpora is crucial as it influences the
but we believe that filtering leads to the removal of characteristics of the generated translationese text.
noisy documents and thus better performance. We evaluate this using Hindi and Gujarati clean
documents from IndicMonoDoc, translating them
Model iXNLI bbc-a iitp-mr iitp-pr midas Avg.
HI-clean 68.74 80.25 67.74 77.05 78.33 74.42 into English. Since Gujarati has limited data (900M
syn-HI_en-unfiltered 67.32 77.92 65.63 76.81 77.58 73.05
syn-HI_en-filtered 69.48 78.98 65.16 77.43 77.33 73.68
tokens), we train a mini-1k model for fair compar-
syn-HI_en-filtered+10% 70.15 79.56 67.09 78.2 79.03 74.81 ison. In Table 4, we see that the synthetic text
Table 3: Effect of reducing model size for Hindi on
generated from Hindi achieves performance at par
IndicGLUE accuracy. All the results reported here are with the EN-clean model, while the synthetic text
on mini-1k. Bold values represent the best amongst from Gujarati significantly lags behind. This is
synthetic splits likely because Hindi is more macaronic than Gu-
jarati, i.e., a lot of Hindi text from the web consists
Using synthetic for one language doesn’t impact of Hinglish, resulting in better translationese text
performance in another: For many multilingual due to increased overlap between languages. This
language models, data imbalance causes a gap in can also be due to the weaker translations generated
performance across languages. But what if we can by the MT model. The performance gap is notable
combine synthetic data along with clean data for in tasks like STS benchmark, NLI (qnli and mnli),
training multilingual models, would the synthetic and CoLA, suggesting poorer translation quality
part deteriorate the performance of the multilingual from Gu→ En compared to Hi→ En.
model? To experiment with this, we train bilin- Impact of model size: Following Table 4 and 3, we
gual base-1k models over different combinations of see that even after scaling down we see consistent
clean and synthetic corpora for English and Hindi improvements for filtering and adding additional
and evaluate their performance on GLUE (Wang data, which empirically shows that indeed using
et al., 2018), and report performance on IndicNLG, synthetic text after filtering is a viable option for
and Machine translation in Appendix A. Follow- pretraining LMs of varying sizes. In Table 3 we
ing Table 2, we see that using Hindi synthetic data see that after filtering and extended pretraining,
does not affect its performance compared to BI-EN- synthetic text outperforms LMs trained on clean
sst2 cola mrpc qnli qqp rte mnli-m mnli-mm stsb
Model Avg.
acc mcc f1 acc f1 acc acc acc pearson
Original EN-clean 87.95 25.59 83.84 78.83 80.78 64.62 71.6 71.69 73.48 70.93
syn-EN_hi-unfiltered 87.53 19.77 79.02 76.49 77.96 55.4 69.65 70.14 67.37 67.04
Translationese
syn-EN_hi-filtered 87.61 22.81 81.95 77.63 80.57 56.31 70.19 70.89 69.29 68.58
Hi->En
syn-EN_hi-filtered + 10% 87.84 26.61 83.27 78.5 80.36 61.37 71.29 71.11 71.91 70.25
syn-EN_gu-unfiltered 83.11 17.66 78.53 66.01 77.68 53.6 63.21 64.55 27.33 59.08
Translationese
syn-EN_gu-filtered 85.66 21.15 81.45 66.35 77.36 54.15 66.27 65.72 26.16 60.47
Gu->En
syn-EN_gu-filtered + 10% 86.58 25.17 81.67 67.1 77.75 57.76 68.78 68.56 27.54 62.32
Table 4: Effect of source selection for generating synthetic data on the dev set of GLUE benchmark. All the results
reported here are on mini-1k. Bold values represent the best amongst synthetic splits
documents from the web in Hindi. marks compared to parallel synthetic documents.
This might be because there is no explicit align-
XLSum XLSum
Model Cnn Dialogsum Avg.
HG QG ment happening during training between parallel
EN-clean 23.87 24.05 16.08 20.39 21.10
syn-EN_hi-unfiltered 22.17 22.97 12.56 18.30 19.00
documents. See Table 6 for chrF++ scores on
syn-EN_hi-filtered 23.27 23.83 15.88 19.83 20.70 FLoRes-200 (Team et al., 2022), and Appendix
A for chrF++ and BLEU scores on IN22-Conv,
Table 5: Performance of English models on NLG tasks.
All the results reported here are on base-1k and use
IN22-Gen (Gala et al., 2023).
Rouge-L F1 scores.
7 Conclusion
Impact on NLG: Without extended pretraining, In this paper, we performed a first of its kind study
language models trained on synthetic text perform of the feasibility of using translationese data for
as well as those trained on clean documents, sug- training language models. We proposed a simple
gesting that for NLG tasks, synthetic data suffices pipeline involving the translation of documents at
for pretraining, eliminating the need for clean data. scale followed by filtering using small and effi-
This trend is evident across Hindi, Gujarati, and cient language models trained on clean data. We
English NLG results (Tables 1 and 5). As their per- then showed on a variety of downstream natural
formance matches models trained on clean data, we language understanding and generative tasks that
refrain from extended pretraining for NLG tasks, language models trained on unclean synthetic data
focusing primarily on abstractive summarization were only slightly inferior to those trained on origi-
for evaluating generation capabilities. nal data, however, filtered synthetic data with ex-
tended pretraining on clean data mostly eliminates
Model
FLORES this gap. We also observed a positive impact of syn-
EN-HI HI-EN Avg.
BI-EN-HI-clean 46.56 51.7 49.13 thetic data on TinyLMs fine-tuned on 10% clean
BI-EN-HI_syn-parallel-filtered 44.12 50.64 47.38 data. While we observed that the source language,
BI-EN-HI_syn-nonparallel-filtered 45.65 51.29 48.47
EN-GU GU-EN Avg.
and potential content, for synthetic data genera-
BI-EN-GU-clean 26.44 35.3 30.87 tion matters, it is clear that synthetic data can help
BI-EN-GU_syn-parallel-filtered 26.77 34.84 30.81 bridge the resource scarcity faced by a vast major-
BI-EN-GU_syn-nonparallel-filtered 26.7 36.54 31.62
ity of languages for language modeling. As a part
Table 6: chrF++ Scores on FLoRes translation task. EN- of this work, we also created IndicMonoDoc, the
HI models are based on base-1k and EN-GU models are largest collection of clean document-level datasets
based on mini-1k for 22 Indic languages and English, which we re-
lease along with our synthetic data, pipelines, and
Impact on Machine Translation: (MT) We fo- code. In the future, we aim to first generate syn-
cus on MT separately as a special case of NLG. thetic data at much larger scales and experiment
We hypothesized that using parallel synthetic docu- with large language models to push the boundaries
ments for bilingual models would improve transla- of language modeling for low-resource languages.
tion performance by enhancing alignment between
languages. However, our evaluation fails this hy- Limitations
pothesis. Results indicate that using nonparallel We consider the following limitations of our work.
synthetic documents yields similar translation per-
formance across language directions and bench- • Work mainly focuses on TinyLMs so not all
observations may carry over to large language a CC-0 License11 .
models, however, synthetic data generated
from translations can surely help fill knowl-
edge gaps. References
Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, and
• We could not experiment with entire test sets Benoît Sagot. 2022. Towards a Cleaner Document-
of IndicNLG tasks like Question Generation, Oriented Multilingual Crawled Corpus. arXiv e-
WikiBio generation, Headline Generation, and prints, page arXiv:2201.06642.
Sentence Summarization due to its vast test
Divyanshu Aggarwal, Vivek Gupta, and Anoop
split but we do not expect the main trends to Kunchukuttan. 2022. IndicXNLI: Evaluating multi-
change given that we already use 4000 exam- lingual inference for Indian languages. In Proceed-
ples per language. ings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 10994–11006,
• For GLUE tasks we report our numbers on Abu Dhabi, United Arab Emirates. Association for
the validation set and not on the test set for all Computational Linguistics.
models since the scale of our experiments was
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al-
large, automatically submitting results on the shamsi, Alessandro Cappelli, Ruxandra Cojocaru,
test set was not feasible. We follow existing Mérouane Debbah, Étienne Goffinet, Daniel Hesslow,
works doing the same. Plus our goal is not Julien Launay, Quentin Malartic, Daniele Mazzotta,
to achieve state-of-the-art results but rather Badreddine Noune, Baptiste Pannier, and Guilherme
Penedo. 2023. The falcon series of open language
to establish the utility of synthetic data by models.
observing trends.
Nikolay Bogoychev and Rico Sennrich. 2019. Domain,
• We have not manually verified synthetic data translationese and noise in synthetic data for neural
so despite cleaning using TinyLMs there are machine translation. CoRR, abs/1911.03362.
chances that there may be some toxic content
Ondrej Bojar, Vojtech Diatka, Pavel Rychlỳ, Pavel
or bad documents. However, this is a future Stranák, Vít Suchomel, Ales Tamchyna, and Daniel
work. Zeman. 2014. Hindencorp-hindi-english and hindi-
only corpus for machine translation. In LREC, pages
• Our framework places significant emphasis 3550–3555.
on the translation model’s performance. Nev-
ertheless, we are confident that this approach Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
will significantly contribute to enhancing the Neelakantan, Pranav Shyam, Girish Sastry, Amanda
performance of mid-resource languages, par- Askell, Sandhini Agarwal, Ariel Herbert-Voss,
ticularly those for which the translation model Gretchen Krueger, Tom Henighan, Rewon Child,
demonstrates considerable proficiency. Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
Ethical Considerations teusz Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish, Alec
As a part of this paper, we release monolingual and Radford, Ilya Sutskever, and Dario Amodei. 2020.
synthetic data. While we have taken care to remove Language models are few-shot learners. In Ad-
vances in Neural Information Processing Systems,
any toxic content, accidental occurrences may exist volume 33, pages 1877–1901. Curran Associates,
and thus we exercise caution when using our data Inc.
for training language models as they may produce
toxic outputs. Given that we have shown the utility Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang.
2021. DialogSum: A real-life scenario dialogue sum-
of synthetic data for training LMs, it should be marization dataset. In Findings of the Association
possible to mass produce synthetic toxic data in for Computational Linguistics: ACL-IJCNLP 2021,
various languages leading to LMs that can generate pages 5062–5074, Online. Association for Computa-
multilingual toxic content. However, this opens up tional Linguistics.
research opportunities on how to detect and filter Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan.
toxic content from synthetically created data. 2020. A survey of multilingual neural machine trans-
We aim to release the code and models with an lation. ACM Comput. Surv., 53(5).
MIT License10 . The dataset will be released under 11
[Link]
10
[Link] public-domain/cc0/
Daniel Deutsch and Dan Roth. 2020. SacreROUGE: An for Computational Linguistics: ACL-IJCNLP 2021,
open-source library for using and developing sum- pages 4693–4703, Online. Association for Computa-
marization evaluation metrics. In Proceedings of tional Linguistics.
Second Workshop for NLP Open Source Software
(NLP-OSS), pages 120–125, Online. Association for Kenneth Heafield. 2011. KenLM: Faster and smaller
Computational Linguistics. language model queries. In Proceedings of the Sixth
Workshop on Statistical Machine Translation, pages
Sumanth Doddapaneni, Rahul Aralikatte, Gowtham 187–197, Edinburgh, Scotland. Association for Com-
Ramesh, Shreya Goyal, Mitesh M Khapra, Anoop putational Linguistics.
Kunchukuttan, and Pratyush Kumar. 2023. Towards
leaving no indic language behind: Building mono- Dan Hendrycks, Collin Burns, Steven Basart, Andy
lingual corpora, benchmark and models for indic Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-
languages. In Proceedings of the 61st Annual Meet- hardt. 2021. Measuring massive multitask language
ing of the Association for Computational Linguistics understanding. In 9th International Conference on
(Volume 1: Long Papers), pages 12402–12426. Learning Representations, ICLR 2021, Virtual Event,
Austria, May 3-7, 2021. [Link].
Sergey Edunov, Myle Ott, Michael Auli, and David
Grangier. 2018. Understanding back-translation at Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
scale. In Proceedings of the 2018 Conference on Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Empirical Methods in Natural Language Processing, Diego de Las Casas, Lisa Anne Hendricks, Johannes
pages 489–500, Brussels, Belgium. Association for Welbl, Aidan Clark, Tom Hennigan, Eric Noland,
Computational Linguistics. Katie Millican, George van den Driessche, Bogdan
Damoc, Aurelia Guy, Simon Osindero, Karen Si-
Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals,
small can language models be and still speak coherent and Laurent Sifre. 2022. Training compute-optimal
english? large language models.
Jay Gala, Pranjal A Chitale, A K Raghavan, Varun Divyanshu Kakwani, Anoop Kunchukuttan, Satish
Gumma, Sumanth Doddapaneni, Aswanth Kumar M, Golla, NC Gokul, Avik Bhattacharyya, Mitesh M
Janki Atul Nawale, Anupama Sujatha, Ratish Pudup- Khapra, and Pratyush Kumar. 2020. Indicnlpsuite:
pully, Vivek Raghavan, Pratyush Kumar, Mitesh M Monolingual corpora, evaluation benchmarks and
Khapra, Raj Dabre, and Anoop Kunchukuttan. 2023. pre-trained multilingual language models for indian
Indictrans2: Towards high-quality and accessible ma- languages. In Findings of the Association for Com-
chine translation models for all 22 scheduled indian putational Linguistics: EMNLP 2020, pages 4948–
languages. Transactions on Machine Learning Re- 4961.
search.
Martin Gellerstam. 1986. Translationese in swedish Yoon Kim and Alexander M. Rush. 2016. Sequence-
novels translated from english. Translation studies level knowledge distillation. In Proceedings of the
in Scandinavia, 1:88–95. 2016 Conference on Empirical Methods in Natu-
ral Language Processing, pages 1317–1327, Austin,
Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Texas. Association for Computational Linguistics.
2012. Building large monolingual dictionaries at the
Leipzig corpora collection: From 100 to 200 lan- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
guages. In Proceedings of the Eighth International method for stochastic optimization. In 3rd Inter-
Conference on Language Resources and Evaluation national Conference on Learning Representations,
(LREC’12), pages 759–765, Istanbul, Turkey. Euro- ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
pean Language Resources Association (ELRA). Conference Track Proceedings.
Gili Goldin, Ella Rabinovich, and Shuly Wintner. 2018. Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier
Native language identification with user generated Garcia, Christopher A Choquette-Choo, Katherine
content. In Proceedings of the 2018 Conference on Lee, Derrick Xin, Aditya Kusupati, Romi Stella,
Empirical Methods in Natural Language Processing, Ankur Bapna, et al. 2023. Madlad-400: A multilin-
pages 3591–3601, Brussels, Belgium. Association gual and document-level large audited dataset. arXiv
for Computational Linguistics. preprint arXiv:2309.04662.
Yvette Graham, Barry Haddow, and Philipp Koehn. Aman Kumar, Himani Shrotriya, Prachi Sahu, Amogh
2019. Translationese in machine translation eval- Mishra, Raj Dabre, Ratish Puduppully, Anoop
uation. CoRR, abs/1906.09833. Kunchukuttan, Mitesh M. Khapra, and Pratyush Ku-
mar. 2022. IndicNLG benchmark: Multilingual
Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Is- datasets for diverse NLG tasks in Indic languages.
lam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, In Proceedings of the 2022 Conference on Empiri-
M. Sohel Rahman, and Rifat Shahriyar. 2021. XL- cal Methods in Natural Language Processing, pages
sum: Large-scale multilingual abstractive summariza- 5363–5394, Abu Dhabi, United Arab Emirates. As-
tion for 44 languages. In Findings of the Association sociation for Computational Linguistics.
Hugo Laurençon, Lucile Saulnier, Thomas Wang, Jihyung Moon, Hyunchang Cho, and Eunjeong L. Park.
Christopher Akiki, Albert Villanova del Moral, Teven 2020. Revisiting round-trip translation for quality
Le Scao, Leandro Von Werra, Chenghao Mou, Ed- estimation. In Proceedings of the 22nd Annual Con-
uardo González Ponferrada, Huu Nguyen, et al. 2022. ference of the European Association for Machine
The bigscience roots corpus: A 1.6 tb composite mul- Translation, pages 91–104, Lisboa, Portugal. Euro-
tilingual dataset. Advances in Neural Information pean Association for Machine Translation.
Processing Systems, 35:31809–31826.
Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing
Chin-Yew Lin. 2004. ROUGE: A package for auto- Xiang, et al. 2016. Abstractive text summarization
matic evaluation of summaries. In Text Summariza- using sequence-to-sequence rnns and beyond. arXiv
tion Branches Out, pages 74–81, Barcelona, Spain. preprint arXiv:1602.06023.
Association for Computational Linguistics.
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Jingwei Ni, Zhijing Jin, Markus Freitag, Mrinmaya
Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na- Sachan, and Bernhard Schölkopf. 2022. Original or
man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth translated? a causal analysis of the impact of trans-
Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav lationese on machine translation performance. In
Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettle- Proceedings of the 2022 Conference of the North
moyer, Zornitsa Kozareva, Mona Diab, Veselin Stoy- American Chapter of the Association for Computa-
anov, and Xian Li. 2022. Few-shot learning with tional Linguistics: Human Language Technologies,
multilingual generative language models. In Proceed- pages 5303–5320, Seattle, United States. Association
ings of the 2022 Conference on Empirical Methods for Computational Linguistics.
in Natural Language Processing, pages 9019–9052,
Abu Dhabi, United Arab Emirates. Association for Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent
Computational Linguistics. Romary. 2019. Asynchronous pipelines for process-
ing huge corpora on medium to low resource infras-
Ilya Loshchilov and Frank Hutter. 2019. Decoupled tructures. Proceedings of the Workshop on Chal-
weight decay regularization. In 7th International lenges in the Management of Large Corpora (CMLC-
Conference on Learning Representations, ICLR 2019, 7) 2019. Cardiff, 22nd July 2019, pages 9 – 16,
New Orleans, LA, USA, May 6-9, 2019. OpenRe- Mannheim. Leibniz-Institut für Deutsche Sprache.
[Link].
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Yash Madhani, Mitesh M. Khapra, and Anoop
Jing Zhu. 2002. Bleu: a method for automatic evalu-
Kunchukuttan. 2023a. Bhasha-abhijnaanam: Native-
ation of machine translation. In Proceedings of the
script and romanized language identification for 22
40th Annual Meeting of the Association for Compu-
indic languages.
tational Linguistics, pages 311–318, Philadelphia,
Yash Madhani, Sushane Parthan, Priyanka Bedekar, Pennsylvania, USA. Association for Computational
Gokul Nc, Ruchi Khapra, Anoop Kunchukuttan, Linguistics.
Pratyush Kumar, and Mitesh Khapra. 2023b. Aksha-
rantar: Open Indic-language transliteration datasets Maja Popović. 2017. chrF++: words helping charac-
and models for the next billion users. In Findings ter n-grams. In Proceedings of the Second Confer-
of the Association for Computational Linguistics: ence on Machine Translation, pages 612–618, Copen-
EMNLP 2023, pages 40–57, Singapore. Association hagen, Denmark. Association for Computational Lin-
for Computational Linguistics. guistics.
Benjamin Marie, Raphael Rubino, and Atsushi Fujita. Maja Popović, Alberto Poncelas, Marija Brkic, and
2020. Tagged back-translation revisited: Why does Andy Way. 2020. Neural machine translation for
it really work? In Proceedings of the 58th Annual translating into Croatian and Serbian. In Proceedings
Meeting of the Association for Computational Lin- of the 7th Workshop on NLP for Similar Languages,
guistics, pages 5990–5997, Online. Association for Varieties and Dialects, pages 102–113, Barcelona,
Computational Linguistics. Spain (Online). International Committee on Compu-
tational Linguistics (ICCL).
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Ryan McDonald. 2020. On faithfulness and factu-
ality in abstractive summarization. In Proceedings Yiwei Qin, Weizhe Yuan, Graham Neubig, and Pengfei
of the 58th Annual Meeting of the Association for Liu. 2023. T5Score: Discriminative fine-tuning of
Computational Linguistics, pages 1906–1919, On- generative evaluation metrics. In Findings of the
line. Association for Computational Linguistics. Association for Computational Linguistics: EMNLP
2023, pages 15185–15202, Singapore. Association
Anthony McEnery, Paul Baker, Robert Gaizauskas, and for Computational Linguistics.
Hamish Cunningham. 2000. Emille: Building a cor-
pus of south asian languages. In Proceedings of the Ella Rabinovich and Shuly Wintner. 2015. Unsuper-
International Conference on Machine Translation vised identification of translationese. Transactions of
and Multilingual Applications in the new Millennium: the Association for Computational Linguistics, 3:419–
MT 2000. 432.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Rico Sennrich, Barry Haddow, and Alexandra Birch.
Millican, Jordan Hoffmann, Francis Song, John 2016b. Neural machine translation of rare words
Aslanides, Sarah Henderson, Roman Ring, Susan- with subword units. In Proceedings of the 54th An-
nah Young, Eliza Rutherford, Tom Hennigan, Ja- nual Meeting of the Association for Computational
cob Menick, Albin Cassirer, Richard Powell, George Linguistics, ACL 2016, August 7-12, 2016, Berlin,
van den Driessche, Lisa Anne Hendricks, Mari- Germany, Volume 1: Long Papers. The Association
beth Rauh, Po-Sen Huang, Amelia Glaese, Jo- for Computer Linguistics.
hannes Welbl, Sumanth Dathathri, Saffron Huang,
Jonathan Uesato, John Mellor, Irina Higgins, Anto- Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
nia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Abu Awal Md Shoeb, Abubakar Abid, Adam
Siddhant Jayakumar, Elena Buchatskaya, David Bud- Fisch, Adam R. Brown, Adam Santoro, Aditya
den, Esme Sutherland, Karen Simonyan, Michela Pa- Gupta, Adrià Garriga-Alonso, Agnieszka Kluska,
ganini, Laurent Sifre, Lena Martens, Xiang Lorraine Aitor Lewkowycz, Akshat Agarwal, Alethea Power,
Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Alex Ray, Alex Warstadt, Alexander W. Kocurek,
Gribovskaya, Domenic Donato, Angeliki Lazaridou, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par-
Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsim- rish, Allen Nie, Aman Hussain, Amanda Askell,
poukelli, Nikolai Grigorev, Doug Fritz, Thibault Sot- Amanda Dsouza, Ameet Rahane, Anantharaman S.
tiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Iyer, Anders Andreassen, Andrea Santilli, Andreas
Daniel Toyama, Cyprien de Masson d’Autume, Yujia Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K.
Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Lampinen, Andy Zou, Angela Jiang, Angelica Chen,
Aidan Clark, Diego de Las Casas, Aurelia Guy, Anh Vuong, Animesh Gupta, Anna Gottardi, Anto-
Chris Jones, James Bradbury, Matthew Johnson, nio Norelli, Anu Venkatesh, Arash Gholamidavoodi,
Blake Hechtman, Laura Weidinger, Iason Gabriel, Arfa Tabassum, Arul Menezes, Arun Kirubarajan,
William Isaac, Ed Lockhart, Simon Osindero, Laura Asher Mullokandov, Ashish Sabharwal, Austin Her-
Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, rick, Avia Efrat, Aykut Erdem, Ayla Karakas, and
Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Ko- et al. 2022. Beyond the imitation game: Quantifying
ray Kavukcuoglu, and Geoffrey Irving. 2022. Scaling and extrapolating the capabilities of language models.
language models: Methods, analysis & insights from CoRR, abs/2206.04615.
training gopher.
Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha,
Bheemaraj, Mayank Jobanputra, Raghavan AK, Bo Wen, and Yunfeng Liu. 2023. Roformer: En-
Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Ma- hanced transformer with rotary position embedding.
halakshmi J, Divyanshu Kakwani, Navneet Kumar,
Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, NLLB Team, Marta R. Costa-jussà, James Cross, Onur
Vivek Raghavan, Anoop Kunchukuttan, Pratyush Ku- Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
mar, and Mitesh Shantadevi Khapra. 2022. Samanan- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
tar: The largest publicly available parallel corpora Jean Maillard, Anna Sun, Skyler Wang, Guillaume
collection for 11 indic languages. Transactions of the Wenzek, Al Youngblood, Bapi Akula, Loic Bar-
Association for Computational Linguistics, 10:145– rault, Gabriel Mejia Gonzalez, Prangthip Hansanti,
162. John Hoffman, Semarley Jarrett, Kaushik Ram
Sadagopan, Dirk Rowe, Shannon Spruit, Chau
Ricardo Rei, Ana C Farinha, Chrysoula Zerva, Daan Tran, Pierre Andrews, Necip Fazil Ayan, Shruti
van Stigt, Craig Stewart, Pedro Ramos, Taisiya Bhosale, Sergey Edunov, Angela Fan, Cynthia
Glushkova, André F. T. Martins, and Alon Lavie. Gao, Vedanuj Goswami, Francisco Guzmán, Philipp
2021. Are references really needed? unbabel-IST Koehn, Alexandre Mourachko, Christophe Ropers,
2021 submission for the metrics shared task. In Pro- Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
ceedings of the Sixth Conference on Machine Trans- 2022. No language left behind: Scaling human-
lation, pages 1030–1040, Online. Association for centered machine translation.
Computational Linguistics.
James Thorne, Andreas Vlachos, Oana Cocarascu,
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Christos Christodoulopoulos, and Arpit Mittal. 2018.
Lavie. 2020. COMET: A neural framework for MT
The fact extraction and VERification (FEVER)
evaluation. In Proceedings of the 2020 Conference
shared task. In Proceedings of the First Workshop on
on Empirical Methods in Natural Language Process-
Fact Extraction and VERification (FEVER), pages 1–
ing (EMNLP), pages 2685–2702, Online. Association
9, Brussels, Belgium. Association for Computational
for Computational Linguistics.
Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016a. Improving neural machine translation models Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way.
with monolingual data. In Proceedings of the 54th 2018. Attaining the unattainable? reassessing claims
Annual Meeting of the Association for Computational of human parity in neural machine translation. In Pro-
Linguistics (Volume 1: Long Papers), pages 86–96, ceedings of the Third Conference on Machine Trans-
Berlin, Germany. Association for Computational Lin- lation: Research Papers, pages 113–123, Brussels,
guistics. Belgium. Association for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Roberto Luis López, Rui Ribeiro, Salomey Osei,
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Sampo Pyysalo, Sebastian Nagel, Shamik Bose,
Kaiser, and Illia Polosukhin. 2017. Attention is all Shamsuddeen Hassan Muhammad, Shanya Sharma,
you need. Advances in neural information processing Shayne Longpre, Somaieh Nikpoor, Stanislav Silber-
systems, 30. berg, Suhas Pai, Sydney Zink, Tiago Timponi Tor-
rent, Timo Schick, Tristan Thrush, Valentin Danchev,
Alex Wang, Amanpreet Singh, Julian Michael, Felix Vassilina Nikoulina, Veronika Laippala, Violette
Hill, Omer Levy, and Samuel R Bowman. 2018. Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Ta-
Glue: A multi-task benchmark and analysis platform lat, Arun Raja, Benjamin Heinzerling, Chenglei Si,
for natural language understanding. arXiv preprint Davut Emre Taşar, Elizabeth Salesky, Sabrina J.
arXiv:1804.07461. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea
Santilli, Antoine Chaffin, Arnaud Stiegler, Debajy-
Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan oti Datta, Eliza Szczechla, Gunjan Chhablani, Han
Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mos- Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan
quera, Bhargavi Paranjabe, Adina Williams, Tal Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Sai-
Linzen, and Ryan Cotterell. 2023. Findings of the ful Bari, Maged S. Al-shaibani, Matteo Manica, Ni-
BabyLM challenge: Sample-efficient pretraining on hal Nayak, Ryan Teehan, Samuel Albanie, Sheng
developmentally plausible corpora. In Proceedings Shen, Srulik Ben-David, Stephen H. Bach, Taewoon
of the BabyLM Challenge at the 27th Conference on Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Ur-
Computational Natural Language Learning, pages mish Thakker, Vikas Raunak, Xiangru Tang, Zheng-
1–34, Singapore. Association for Computational Lin- Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri,
guistics. Hadar Tojarieh, Adam Roberts, Hyung Won Chung,
Jaesung Tae, Jason Phang, Ofir Press, Conglong Li,
Alex Warstadt, Amanpreet Singh, and Samuel R. Bow- Deepak Narayanan, Hatim Bourfoune, Jared Casper,
man. 2019. Neural network acceptability judgments. Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia
Transactions of the Association for Computational Zhang, Mohammad Shoeybi, Myriam Peyrounette,
Linguistics, 7:625–641. Nicolas Patry, Nouamane Tazi, Omar Sanseviero,
Patrick von Platen, Pierre Cornette, Pierre François
BigScience Workshop, :, Teven Le Scao, Angela Fan, Lavallée, Rémi Lacroix, Samyam Rajbhandari, San-
Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel chit Gandhi, Shaden Smith, Stéphane Requena, Suraj
Hesslow, Roman Castagné, Alexandra Sasha Luc- Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet
cioni, François Yvon, Matthias Gallé, Jonathan Singh, Anastasia Cheveleva, Anne-Laure Ligozat,
Tow, Alexander M. Rush, Stella Biderman, Albert Arjun Subramonian, Aurélie Névéol, Charles Lover-
Webson, Pawan Sasanka Ammanamanchi, Thomas ing, Dan Garrette, Deepak Tunuguntla, Ehud Reiter,
Wang, Benoît Sagot, Niklas Muennighoff, Albert Vil- Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bog-
lanova del Moral, Olatunji Ruwase, Rachel Bawden, danov, Genta Indra Winata, Hailey Schoelkopf, Jan-
Stas Bekman, Angelina McMillan-Major, Iz Belt- Christoph Kalo, Jekaterina Novikova, Jessica Zosa
agy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pe- Forde, Jordan Clive, Jungo Kasai, Ken Kawamura,
dro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Liam Hazan, Marine Carpuat, Miruna Clinciu, Na-
Yacine Jernite, Julien Launay, Margaret Mitchell, joung Kim, Newton Cheng, Oleg Serikov, Omer
Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Antverg, Oskar van der Wal, Rui Zhang, Ruochen
Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani
Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun,
Chris Emezue, Christopher Klamm, Colin Leong, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov,
Daniel van Strien, David Ifeoluwa Adelani, Dragomir Vladislav Mikhailov, Yada Pruksachatkun, Yonatan
Radev, Eduardo González Ponferrada, Efrat Lev- Belinkov, Zachary Bamberger, Zdeněk Kasner, Al-
kovizh, Ethan Kim, Eyal Bar Natan, Francesco De ice Rueda, Amanda Pestana, Amir Feizpour, Ammar
Toni, Gérard Dupont, Germán Kruszewski, Giada Khan, Amy Faranak, Ana Santos, Anthony Hevia,
Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Antigona Unldreaj, Arash Aghagol, Arezoo Abdol-
Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar lahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh
Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Behroozi, Benjamin Ajibade, Bharat Saxena, Car-
Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, los Muñoz Ferrandis, Daniel McDuff, Danish Con-
Joseph Tobing, Joydeep Bhattacharjee, Khalid Al- tractor, David Lansky, Davis David, Douwe Kiela,
mubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Duong A. Nguyen, Edward Tan, Emi Baylor, Ez-
Leon Weber, Long Phan, Loubna Ben allal, Lu- inwanne Ozoani, Fatima Mirza, Frankline Onon-
dovic Tanguy, Manan Dey, Manuel Romero Muñoz, iwu, Habib Rezanejad, Hessie Jones, Indrani Bhat-
Maraim Masoud, María Grandury, Mario Šaško, tacharya, Irene Solaiman, Irina Sedenko, Isar Ne-
Max Huang, Maximin Coavoux, Mayank Singh, jadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis
Mike Tian-Jian Jiang, Minh Chien Vu, Moham- Sanz, Livia Dutra, Mairon Samagaio, Maraim El-
mad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, badri, Margot Mieskes, Marissa Gerchick, Martha
Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Akinlolu, Michael McKenna, Mike Qiu, Muhammed
Omar Espejel, Ona de Gibert, Paulo Villegas, Pe- Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Ra-
ter Henderson, Pierre Colombo, Priscilla Amuok, jani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel,
Quentin Lhoest, Rheza Harliman, Rishi Bommasani,
Ran An, Rasmus Kromann, Ryan Hao, Samira Al- affected by a small margin and coupled with results
izadeh, Sarmad Shubber, Silas Wang, Sourav Roy, in Table 2 showing that scores are not affected by
Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le,
using Hindi synthetic parallel data.
Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap,
Alfredo Palasciano, Alison Callahan, Anima Shukla,
Antonio Miranda-Escalada, Ayush Singh, Benjamin B Training and Evaluation
Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag
Jain, Chuxin Xu, Clémentine Fourrier, Daniel León B.1 Training
Periñán, Daniel Molano, Dian Yu, Enrique Manjava-
cas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, In this section, we list the dataset and hyperparam-
Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, eters used for training our models for the experi-
Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, ments. For the pretraining of the base models, we
Jonas Golde, Jose David Posada, Karthik Ranga-
sai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa keep a hard limit for the base-1k model as 2.38B
Shinzato, Madeleine Hahn de Bykhovetz, Maiko tokens and for the mini-1k model as 1B tokens. But
Takeuchi, Marc Pàmies, Maria A Castillo, Mari- for the TinyLM we relax this token limit until we
anna Nezhurina, Mario Sänger, Matthias Samwald, see overfitting. For our experiments, we use the
Michael Cullan, Michael Weinberg, Michiel De
Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank,
NVIDIA A100-SXM4-80GB GPUs.
Myungsun Kang, Natasha Seelam, Nathan Dahlberg,
Nicholas Michio Broad, Nikolaus Muellner, Pascale B.2 Extended pretraining
Fung, Patrick Haller, Ramya Chandrasekhar, Renata
Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline For the mini-1k models, we randomly sample 100M
Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, tokens from the clean subset of IndicMonoDoc for
Shlok S Deshmukh, Shubhanshu Mishra, Sid Ki- the target language, and for the base-1k model, we
blawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Ku- sample 200M for extended pretraining. We use
mar, Stefan Schweter, Sushil Bharati, Tanmay Laud,
Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Ya- the same hyperparameters as training and perform
nis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, extended pretraining for 2 epochs over this newly
Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli sampled clean data.
Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and
Thomas Wolf. 2022. Bloom: A 176b-parameter B.3 Fine-tuning
open-access multilingual language model.
For GLUE tasks we use the dev split on the clean
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,
Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and part and do hyperparameter tuning to achieve the
Colin Raffel. 2021. mT5: A massively multilingual best scores, and then we use the same hyperparam-
pre-trained text-to-text transformer. In Proceedings eters for all synthetic experiments. For IndicGLUE
of the 2021 Conference of the North American Chap- we follow a similar setting for the val split to find
ter of the Association for Computational Linguistics:
Human Language Technologies, pages 483–498, On- good hyperparameters and report results on the test
line. Association for Computational Linguistics. split like Kakwani et al. (2020). For all classifi-
cation and regression tasks, we use a single linear
Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021.
layer and use an appropriate activation function
Bartscore: Evaluating generated text as text genera-
tion. In Advances in Neural Information Processing for classification and regression respectively. We
Systems 34: Annual Conference on Neural Informa- use an Adam optimizer (Kingma and Ba, 2015)
tion Processing Systems 2021, NeurIPS 2021, De- with a learning rate of 1e−5 and a batch size of
cember 6-14, 2021, virtual, pages 27263–27277. 48. For NLG tasks we do extended pretraining
Mike Zhang and Antonio Toral. 2019. The effect of using a separator token in between the input and
translationese in machine translation test sets. CoRR, output sequence with an effective batch size of 768
abs/1906.08069. examples and only calculate loss for the output se-
quence. We use an AdamW optimizer (Loshchilov
A Additional results
and Hutter, 2019) with learning rate = 6e−4 , weight
We report additional results in this section. Tables decay = 1e−1 , β1 = 0.9, β2 = 0.95 and ϵ = 1e−5 .
7, 8 show the chrF++ and BLEU scores across three For translation, we randomly sample 1M parallel
translation evaluation benchmarks. This shows that sentence for each language pair from the samanan-
using parallel synthetic data does not deteriorate the tar corpus (Ramesh et al., 2022) and evaluate on
performance of the language model. Similar results FloRes (Team et al., 2022), IN22-Conv and IN22-
are shown in Table 9 for IndicNLG tasks where Gen (Gala et al., 2023). We list the batch size and
performance on Hindi generation tasks are only number of epochs of each task in Table 12.
IN22-Conv IN22-Gen FLORES
Model
EN-HI HI-EN EN-HI HI-EN EN-HI HI-EN
BI-EN-HI-clean 41.22 50.3 43.49 47.83 46.56 51.7
BI-EN-HI_syn-parallel-filtered 41.92 49.67 41.61 46.95 44.12 50.64
BI-EN-HI_syn-nonparallel-filtered 40.74 49.54 42.28 47.66 45.65 51.29
EN-GU GU-EN EN-GU GU-EN EN-GU GU-EN
BI-EN-GU-clean 35.85 41.27 22.95 31.83 26.44 35.3
BI-EN-GU_syn-parallel-filtered 34.36 41.86 22.93 30.84 26.77 34.84
BI-EN-GU_syn-nonparallel-filtered 34.49 42.08 23.06 32.81 26.7 36.54
Table 7: chrF++ Scores on FloRes, IN22-Conv and IN22-Gen splits for translation task. EN-HI models are based on
base-1k and EN-GU models are based on mini-1k. Bold values represent the best amongst synthetic splits.
Table 8: BLEU Scores on FloRes, IN22-Conv and IN22-Gen splits for translation task. EN-HI models are based on
base-1k and EN-GU models are based on mini-1k. Bold values represent the best amongst synthetic splits.
Table 9: Performance of Bilingual models on IndicNLG tasks. All the results reported here are on base-1k and use
Rouge-L F1 scores. Bold values represent the best amongst synthetic splits.
Task Batch size Epochs Metric
IndicXNLI 48 5 Accuracy
BBC-Articles 24 20 Accuracy
IITP-MR 24 20 Accuracy
Hyperparameter Value
vocab_size 56000 IITP-PR 48 20 Accuracy
val_every 0.05 MIDAS 48 20 Accuracy
bs 48 Headline Generation 768 2 Rouge-L F1
n_embed 768 Sentence Summarization 768 2 Rouge-L F1
num_blocks 4 Question Generation 768 2 Rouge-L F1
num_heads 16
WikiBio Generation 768 4 Rouge-L F1
head_size n_embed // num_heads
context_len 1024 iNLTK 48 20 Accuracy
block_size context_len sst2 48 10 Accuracy
attn_drop_value 0.1 CoLA 48 30 MCC
dropout 0.1 mrpc 48 30 F1
ffn_drop_value 0.1 qnli 48 10 Accuracy
use_flashattn TRUE
qqp 48 5 F1
ffn_scaling 4
positional_embedding rope’ rte 48 30 Accuracy
rotatory_embedding_dim head_size // 2 mnli-matched 48 5 Accuracy
lr 6.00E-04 mnli-mismatched 48 5 Accuracy
wd 1.00E-01 stsb 48 20 Pearson
beta_1 0.9 XLSum Headline Gen. 768 4 Rouge-L F1
beta_2 0.95
XLSum Question Gen. 768 4 Rouge-L F1
eps 1.00E-05
epochs 2 CNN Dailymail 768 4 Rouge-L F1
precision bf16 DialogSum 768 4 Rouge-L F1
accumulate_grad_batches 8 Samanantar 768 2 chrF++ / BLEU
gradient_clip_val 1
strategy ddp’
accelerator gpu’ Table 12: Hyperparameters used for finetuning tasks
warmup_steps 5000
num_workers 16
SHUFFLE_SEED 42
PIN_MEMORY TRUE
NUM__NODES 1 B.4 Evaluation
NUM_DEVICES 2
We use torch metrics12 to calculate accuracy, f1-
Table 10: Hyperparameters used for training the mini-1k score, Pearson correlation, Matthew’s correlation
model coefficient. We report chrF++ scores13 and BLEU
scores14 (Papineni et al., 2002) using the sacre-
BLEU15 implementation and Rouge-L f1 scores
using the sacreRouge (Deutsch and Roth, 2020)
implementation by the xl-sum repository16 .
Hyperparameter Value
We report English scores for NLU on the valida-
vocab_size 56000 tion split of the GLUE benchmark and test splits
val_every 0.05
bs 48 for XL-Sum, CNN Dailymail, and Dialogsum NLG
n_embed 768
num_blocks 12 benchmarks. For Hindi and Gujarati, we use the
num_heads 12
head_size n_embed // num_heads test split of IndicGLUE and IndicXNLI.
context_len 1024
block_size context_len For classification and regression tasks, we use
attn_drop_value 0.1
dropout 0.1
the models finetuned according to hyperparame-
ffn_drop_value
use_flashattn
0.1
TRUE
ters mentioned in Appendix B.3 to keep fair com-
ffn_scaling 4 parison for all models and mention results on the
positional_embedding rope’
rotatory_embedding_dim head_size // 2 final epoch. For generations on IndicNLG and En-
lr 6.00E-04
wd 1.00E-01 glish NLG tasks, we use beam search with a beam
beta_1 0.9
beta_2 0.95 width of 5, length penalty of 1.0, n_gram repetition
eps 1.00E-05
epochs 2
penalty of 4 n_grams with sampling set to false and
precision
accumulate_grad_batches
bf16
8
early stopping set to true. We also set a maximum
gradient_clip_val 1 generation length to 64 tokens. For the translation
strategy ddp’
accelerator gpu’ task, we follow a beam search with a beam width of
warmup_steps 5000
num_workers 16 5, maximum new tokens to 256 and early stopping
SHUFFLE_SEED 42
PIN_MEMORY TRUE
12
NUM__NODES 1 [Link]
NUM_DEVICES 2
stable/pages/[Link]
13
chrF++ signature
Table 11: Hyperparameters used for training the base-1k nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.4.0
model 14
sacreBLEU signature:
nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.4.0
15
[Link]
16
[Link]
to true. such cases. We also do the reverse using the 1B
indic-en version19 of IndicTrans2 to translate 5B
C Perplexity filtering Hindi tokens and 900M Gujarati tokens from Indic-
MonoDoc to English. Together these make up the
C.1 Creating synthetic data
unfiltered synthetic translationese data in English,
“Translationese” is a term used to describe peculiar- Hindi, and Gujarati. We use this corpus for the syn-
ities in the text translated into a specific language, thetic and clean+synthetic part of our experiments.
differentiating it from content originally written in
that language (Gellerstam, 1986). Translated texts C.2 Perplexity filtering
into the target language (via humans or machine-
Following Figure 2, we use these TinyLMs to fil-
generated) often show distinctive features that dif-
ter the generated synthetic translationese corpora
ferentiate them from their original counterparts in
from IndicTrans2. We do this by using perplex-
the target language. These disparities arise from
ity as a measure of document quality score. For
either the influence of the translation process itself
language models, perplexity quantifies how well
on the final product or the inherent “fingerprints”
a model predicts a sequence of tokens. A lower
of the source language subtly present in the tar-
perplexity indicates better predictive performance.
get language rendition (Rabinovich and Wintner,
It is calculated by:
2015). This is a common phenomenon in transla-
tion models where the target language translations N
( )
often show characteristics of the source language 1 X
PPL(W ) = exp − log pθ (wi | w<i )
and add bias to the evaluation of downstream tasks N
i
(Toral et al., 2018), (Zhang and Toral, 2019), (Gra-
ham et al., 2019). So far a lot of work on syn- where the negative log-likelihood measures the er-
thetic translated data has been done for using back ror of the model’s predictions. While calculat-
translations (Sennrich et al., 2016a), (Edunov et al., ing perplexity over a sequence of tokens W ∈
2018) for improving Machine translation perfor- w1 , w2 , . . . , wN we skip the first s tokens where
mance (Marie et al., 2020),(Bogoychev and Sen- s = 10, e = 1024 and calculate loss until only the
nrich, 2019),(Ni et al., 2022) or for classification first e tokens of the document. We find setting e
tasks like native language identification (Goldin to larger values can lead to higher variance in the
et al., 2018), etc. Tranlationese data has been used document scores due to the size of the TinyLM.
for many tasks but we explore the efficacy of us- After initial analysis, we choose s and e such that
ing translationese data for pretraining of language we remove the high uncertainty of the language at
models. We collect monolingual corpora in the the start of an unseen document and avoid penal-
source language as mentioned in section 4 and izing longer documents due to the fragility of the
utilize a powerful off-the-shelf translation model extrapolation ability of TinyLM20 . Note that it is
IndicTrans2 (Gala et al., 2023) to generate trans- important to choose e such that the language model
lationese data. We use the 1B en-indic version17 gives a uniform estimate of perplexity over an al-
of IndicTrans2 using beam search with a beam ready seen sequence of tokens ∈ ws , ws+1 , . . . , we .
value of 5 to translate 5 billion English tokens from For our experiments, we use the TinyLMs to score
IndicMonoDoc to Hindi and Gujarati. Since In- all synthetically generated translationese data and
dicTrans2 can only handle a max sentence length calculate a document score using the above method.
of 256 BPE tokens, we split the documents using Following Laurençon et al. (2022) we do subsam-
Moses Sentence Splitter18 to perform translations pling by thresholding document perplexity scores
into the target language at the sentence level and except Laurençon et al. (2022) did it using Ken-LM
then merge again to form documents. We also re- (Heafield, 2011) and we do it using our TinyLM.
pair translations that exceed in length 256 BPE We keep the threshold value such that we include
tokens using the TinyLM trained on clean corpora enough documents to reach the computed optimal
as mentioned in Section 5 to complete the sen- token count for pretraining experiments.
tence translation, we encounter only 0.002% of
19
[Link]
17
[Link] indictrans2-indic-en-1B
indictrans2-en-indic-1B 20
During experiments we saw that these TinyLMs can only
18
[Link] go up to a certain context length before deteriorating in quality.
Language IndicCorpv2 Ours out webpages that consist of a considerable amount
bn 926.00 5258.47 of English content using a simple script recognition
en 6501.00 11986.53 regex. We perform this scrapping majorly for the
gu 901.00 887.18 bottom 14 low-resource languages. We also add
hi 6107.00 11268.33 script-level recognition using Unicode characters22
kn 875.00 567.16 for each language before crawling a webpage to
ml 931.00 845.32
avoid scrapping non-Indic text.
mr 795.00 1066.76
ne 852.00 1542.39 D.2 Post processing
pa 732.00 449.61
A lot of crawled content consists of unwanted text
ta 476.00 2171.92
te 731.00 767.18 like HTML tags, emoticons, and text in another lan-
ur 667.00 2391.79 guage. We use manual filtering pipelines inspired
as 67.00 57.64 by OSCAR (Ortiz Suárez et al., 2019), (Abadji
brx 2.50 2.25 et al., 2022) to remove such content. We addition-
doi 0.10 0.37 ally use a language detection-based (LID) filtering
gom 31.90 2.91 using cld323 and IndicLID-FTN model (Madhani
kas 0.06 1.27 et al., 2023a) to discard languages not of inter-
mai 13.70 1.51 est. Following Doddapaneni et al. (2023) we per-
mni 0.60 0.99 form document filtering to remove offensive text
or 122.00 81.96 from the corpora using a list of offensive words and
sa 125.00 80.09 phrases extended from work by Team et al. (2022)
sat 4.00 3.05 which consists of offensive words in 209 languages.
sd 13.20 83.81 We also use a Romanized version of this list using
Table 13: Languagewise corpora size comparison in
the transliteration tool by Madhani et al. (2023b) to
Million tokens perform toxic document filtering in 17 languages.
Following Kakwani et al. (2020) & Doddapaneni
et al. (2023) we merge all the filtered corpus with
D IndicMonoDoc Wikipedia, OSCAR (Ortiz Suárez et al., 2019) and
some dumps of mC4 (Xue et al., 2021). Finally,
In this section, we describe the process of creat-
we perform deduplication at paragraph level using
ing the IndicMonoDoc corpus which is the largest
Murmurhash algorithm24 with a 128-bit unsigned
document-level corpora for Indic languages consist-
hash for each monolingual split of the corpora. Af-
ing of 39.5 billion tokens spanning 23 languages.
ter all post-processing steps, the language wise size
IndicMonoDoc comprises 27.5B Indic tokens and
of the corpora is mentioned in Table 13. A major
12B tokens of English tokens. Table 13 shows
chunk of the corpus is comprised of English, Hindi,
language language-wise deduplicated size of the
and Bengali which make up 72.15% of the corpora.
IndicMonoDoc corpus and Figure 3 shows a com-
parative 100% stacked bar plot with IndicCorpv2
which is a sentence level corpora.
D.1 Crawling
To extract URLs from the web we sample word
level n-grams; n=2,...,6 from a sample monolingual
corpora to create a list of keyword searches. We
then randomly merge k; k=1,..,4 keywords to form
a query. Using these queries we perform automatic
web searches to collect a large repository of URLs.
We merge this list with a manual list of sources to
perform URL-level deduplication. We crawl these
webpages leaving out some of them21 . We leave
22
[Link]
21 23
We leave webpages consisting of a [Link] file and [Link]
24
URLs containing offensive text or social media links [Link]