0% found this document useful (0 votes)

37 views18 pages

Do Not Worry If You Do Not Have Data

This paper investigates the use of synthetic data, specifically Translationese, for pre-training language models in low-resource languages like Hindi and Gujarati. It demonstrates that models trained on filtered synthetic data can achieve performance close to those trained on clean data, particularly when combined with a small amount of clean data for fine-tuning. The authors introduce IndicMonoDoc, a large collection of monolingual document-level corpora, to aid in bridging the performance gap between English and non-English language models.

Uploaded by

720matheusmendes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views18 pages

Do Not Worry If You Do Not Have Data

Uploaded by

720matheusmendes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Do Not Worry if You Do Not Have Data:

Building Pretrained Language Models Using Translationese

Table 1

Meet Doshi* , Raj Dabre** , and Pushpak Bhattacharyya*

Language English Hindi Gujarati

Clean 76.87 77.60 79.95

Synthetic 71.84 74.26 77.65
*
CFILT, Indian Institute of Technology Bombay
Un ltered

** 73.78 75.67
Synthetic 78.47
NICT, Japan Un ltered + 10%

* 74.60 76.63
Synthetic 79.55
{meetdoshi,pb}@[Link] Filtered

** 75.83 77.52
Synthetic 80.23
{prajdabre}@[Link] + 10%

Abstract English Hindi

Performance comparison of Translationese text on NLU tasks
Gujarati

81
In this paper, we explore the utility of Transla-
tionese as synthetic data created using machine
78.5
arXiv:2403.13638v2 [[Link]] 21 Mar 2024

translation for pre-training language models

Avg. across tasks

(LMs). Pre-training requires vast amounts of
monolingual data, which is mostly unavailable 76

for languages other than English. Recently,

there has been a growing interest in using syn- 73.5

thetic data to address this data scarcity. We take

the case of English and Indic languages and
71
translate web-crawled monolingual documents Clean Synthetic
Un ltered
Synthetic
Un ltered + 10%
Synthetic
Filtered
Synthetic
Filtered + 10%
(clean) into the target language. Then, we train Training Data

language models containing 28M and 85M pa-

rameters on this translationese data (synthetic). Figure 1: Comparison of NLU performance between
We show that their performance on downstream language models trained on clean data and synthetic
natural language understanding and generative translationese data in English, Hindi, and Gujarati shows
tasks is only 3.56% poorer on NLU tasks and that filtering synthetic text and extending training on a
1.51% on NLG tasks than LMs pre-trained on clean subset (10%) can bring the performance of models
clean data. Further, we propose the use of trained on synthetic data closer to LMs trained only on
lightweight TinyLMs pre-trained on clean data clean data extracted from the web.
to filter synthetic data efficiently which signifi-
cantly improves the performance of our models.
We also find that LMs trained on synthetic data able (Kudugunta et al., 2023), but a vast majority
strongly benefit from extended pretraining on of languages don’t have comparable data as com-
a tiny fraction (10%) of clean data. We release pared to English. As a consequence, many LLMs,
the data we collected and created as a part of
both monolingual and multilingual, involving these
this work, IndicMonoDoc, the largest collection
of monolingual document-level corpora, which languages still show poor performance for vari-
we hope will help bridge the gap between En- ous downstream tasks. For example, the largest
glish and non-English performance for large open source multilingual model BLOOM (Work-
language models. shop et al., 2022) covers 46 natural 1languages span-
fifi ning 9 language families, but the top 5 languages
1 Introduction comprise 74.14% of the data. Despite the benefits
Large language models (LLMs)(Brown et al., 2020; of multilingualism (Dabre et al., 2020), this skew
Workshop et al., 2022; Almazrouei et al., 2023; in data still means that the low-resource languages
Lin et al., 2022) have been able to perform very will not perform well.
well on downstream tasks like MMLU (Hendrycks Fortunately synthetic data is an option and pre-
et al., 2021), Big-Bench (Srivastava et al., 2022), vious works such as, but not limited to, back-
etc, and have even started to reach human potential translation (Sennrich et al., 2016a), sequence dis-
in many of these tasks. But this performance has tillation (Kim and Rush, 2016), also known as
very largely been credited to their scale and the forward translation, etc. have shown that syn-
vast amount of data that they have been fed. Most thetic data obtained using machine translation (MT)
of these language models (LMs) perform well in can supplement resource scarcity and can signifi-
languages like English where abundant data is avail- cantly enhance model performance (Popović et al.,
2020; Gala et al., 2023). However, to the best of 2 Related Work
our knowledge, there has been no work on show-
ing the effectiveness of synthetic data for train- This paper focuses on creating, filtering, and utiliz-
ing LMs. Furthermore, the quality of synthetic ing synthetic data for training TinyLMs.
data is also important, which many works take for Monolingual Data: Previous efforts of collect-
granted. While there are options such as round- ing monolingual corpora for Indic languages in-
trip-translation (Moon et al., 2020) or referenceless clude but not limited to the EMILLE/CIIL cor-
neural quality estimation (QE) (Rei et al., 2021), pus (McEnery et al., 2000), HindMonoCorp (Bo-
they either involve twice the compute or a reason- jar et al., 2014), Leipzig corpus (Goldhahn et al.,
ably large model not available for most languages, 2012), IndicCorpv1 (Kakwani et al., 2020) and
and this might not be optimal to determine the qual- more recently IndicCorpv2 (Doddapaneni et al.,
ity of synthetic documents efficiently. We thus 2023) which shows significant increase in size com-
consider TinyLMs (Eldan and Li, 2023) as an effi- pared to previous Indic corpora. Although the size
cient alternative, which have been shown to model of IndicCorpv2 is large, it is a sentence-level cor-
documents by their fluent paragraph generation ca- pora that can be used to train NLU models but
pabilities. language models with their increase in size can
In this paper, we focus on English and Indic be trained on longer contexts that require long se-
languages and present a comprehensive study of quences of documents. In this work, we build upon
the utility of synthetic, translationese (Gellerstam, existing monolingual corpora for Indic languages
1986), monolingual data obtained using machine and also show the efficacy of using synthetic data.
translation (MT) for training LMs. We propose a Synthetic Data Generation and Quality Estima-
simple framework that involves training small lan- tion: Synthetic data is valuable for NLP tasks like
guage models, TinyLMs, on original web-crawled using back translations (Sennrich et al., 2016a),
data (clean) and then using them to filter synthetic (Edunov et al., 2018) to enhance Machine transla-
data. We then compare LMs trained on clean and tion (Marie et al., 2020), (Bogoychev and Sennrich,
synthetic data followed by fine-tuning on down- 2019), (Ni et al., 2022), or for tasks such as na-
stream tasks, where we observe that, unfiltered syn- tive language identification (Goldin et al., 2018).
thetic data LMs perform slightly poor compared However, there’s a limited exploration of using
to LMs trained on clean data, but after filtering, it synthetic data for pretraining LMs due to issues
matches its performance. We also show that tuning like hallucination (Maynez et al., 2020), and un-
these synthetic data LMs on small clean data leads grounded non-factual text (Thorne et al., 2018).
to further improvements. Machine translation can help mitigate these issues,
but translation errors may still occur, impacting
Our contributions are:
text quality. Evaluating synthetic text quality with
a. A simple framework involving high-quality MT
round-trip-translation (RTT) BLEU scores requires
models and TinyLMs trained on clean web-crawled
twice the compute and is prone to errors. Alter-
data to mass-produce and filter synthetic data for
native evaluation methods like BARTScore (Yuan
LM training.
et al., 2021), T5Score (Qin et al., 2023), MQM
b. Demonstrating the efficacy of language models & COMET (Rei et al., 2020) require large-scale
trained on filtered synthetic data across a range of models or human annotations, limiting scalabil-
downstream tasks in natural language understand- ity. Approaches like KenLM (Heafield, 2011) have
ing and generation. been used to filter monolingual corpora based on
c. A new document-level monolingual corpora (In- perplexity. We take inspiration from this and show
dicMonoDoc) consisting of 39.5B tokens worth of that efficient TinyLM with just 28M parameters
monolingual clean document-level data spanning can be used to filter synthetic text.
22 scheduled languages and English. This dataset, TinyLMs: Language models as small as 10M pa-
which is 2 times larger than IndicCorpV2 (Dodda- rameters have been shown to produce fluent and
paneni et al., 2023), will be made public alongside consistent stories with almost perfect grammar (El-
our models and code12 . dan and Li, 2023) which means LMs even at a small
1 scale have language understanding. Challenges like
[Link]
2
[Link] BabyLM (Warstadt et al., 2023) focus on improv-
IITB-MonoDoc ing LMs with a fixed data budget which enables
Figure 2: Overview of our approach to pre-train language models using translationese data. We leverage rich
monolingual corpora in the src language and scarce corpora in the tgt language. Our method involves employing a
pre-trained machine translation model to translate src to tgt. We then filter the resulting text using a TinyLM trained
solely on clean tgt monolingual data for perplexity. The filtered synthetic data can be used to further pretrain larger
language models.

exhaustive study of LM development methodolo-

gies, which can then be applied to larger LMs. We
aim to utilize lightweight TinyLMs for efficient
filtering of synthetic documents.

3 Methodology
In this section, we describe our framework for lever-
aging synthetic data for LM training. This con-
sists of monolingual data curation from the web
(clean), training a TinyLM with it, translation of
clean data, using the aforementioned TinyLM to
filter synthetic data, and then using this filtered data
for training a larger LM to be used for downstream
tasks. Our framework is described in Figure 2.
Figure 3: Language-wise corpora size comparison with
3.1 Monolingual Data
IndicCorpv2 (Doddapaneni et al., 2023): Stacked Bars
Web Crawled (Clean): Following Doddapaneni
et al. (2023); Rae et al. (2022); Team et al. (2022),
for all languages of interest, we a. obtain a list of the filtered corpus with Wikipedia, OSCAR (Ortiz
URLs to be crawled via word-level n-grams passed Suárez et al., 2019) and some dumps of mC4 (Xue
to a search engine, b. after URL deduplication, we et al., 2021) and finally, g. perform deduplication
crawl all applicable webpages, c. automatically at paragraph level using Murmurhash algorithm4
and manually (Ortiz Suárez et al., 2019; Abadji with a 128-bit unsigned hash for each monolingual
et al., 2022) filter out unwanted text like HTML split of the corpora.
tags and emoticons, d. use language detection- Translationese (Synthetic): We utilize state-of-
based (LID) filtering using cld33 and IndicLID- the-art MT models like IndicTrans2 (Gala et al.,
FTN model (Madhani et al., 2023a) to discard lan- 2023) to generate translationese data. We use beam
guages not of interest, e. perform document fil- search with a beam value of 5 to translate English
tering to remove offensive text using toxic words tokens from the aforementioned crawled corpus to
list provided by Team et al. (2022), f. merge all the languages of interest. Most MT models have a
3 4
[Link] [Link]
Figure 4: The plot illustrates TinyLM’s perplexity mean and variance across various datasets: Clean-EN (left),
Syn-EN from filtered Hindi (middle), and Syn-EN from filtered Gujarati (right). Despite filtering, English documents
generated from translating Gujarati show consistently higher variance.

maximum token limit, and thus we split the docu- fine-tune them for natural language understand-
ments using Moses Sentence Splitter5 to perform ing (NLU) tasks such as IndicGLUE (Kakwani
translations into the target language at the sentence et al., 2020), GLUE (Wang et al., 2018) and genera-
level and then merge again to form documents. Our tion (NLG) benchmarks such as IndicNLG (Kumar
experiments also focus on synthetic English data et al., 2022), summarization tasks (Nallapati et al.,
translated from Hindi and Gujarati. 2016), (Chen et al., 2021) and machine translation
benchmarks (Team et al., 2022), (Gala et al., 2023).
3.2 Tiny Language Models (TinyLMs)
TinyLMs are simply tiny versions of language mod- 4 IndicMonoDoc
els inspired by Eldan and Li (2023). We follow Following the monolingual data curation strategy
the Transformer architecture (Vaswani et al., 2017) in Section 3, we crawl data for English and 22 Indic
used by Eldan and Li (2023) and train it using only languages. As a result, we end up with IndicMon-
clean monolingual documents. Instead of using oDoc, with 27.5 billion tokens of Indic documents
learned positional encodings, we use RoPE em- and 12 billion tokens of English documents for a
beddings (Su et al., 2023) for better extrapolation total of 39.5 billion tokens of clean monolingual
to longer documents. We rely on Chinchilla scal- data. This is the largest ever for Indic languages,
ing laws Hoffmann et al. (2022) and use compute even surpassing (Doddapaneni et al., 2023) by 2
optimal word tokens to train our models. times. We use IndicMonoDoc for all clean parts
3.3 Synthetic Data Filtering of our experiments. We report additional details of
IndicMonoDoc in Appendix D.
We first train TinyLMs on crawled data and then
use them to compute perplexities on our synthetic 4.1 Analysis of Crawled Corpora
documents (W ∈ w1 , w2 , . . . , wN ) using the equa- Figure 3 gives an overview of the comparison of
tion: IndicMonoDoc which is a document-level corpus
( N
) with that of IndicCorpV2 which is a sentence-level
1 X
PPL(W ) = exp − log pθ (wi | w<i ) corpus. It is important to note that we paid special
N attention to the low-resource languages.
i

We also use our TinyLM to repair sentences in doc- 4.2 Analysis of Synthetic Data
uments that exceed the maximum length of the MT We use IndicMonoDoc for the clean part of our
models, but this happens in only 0.002% of such experiments and translate parts of it for the syn-
cases. We provide more details of our approach in thetic experiments. Figure 4 shows the perplexity
Appendix C. mean and variance scores for TinyLM across token
3.4 Final LM and Downstream Task Training positions in the documents. This shows that on
unseen documents, TinyLM shows higher variance
Using the filtered synthetic corpora, we train our
on English documents generated by translating Gu-
final LMs which are comparatively larger, and
jarati documents from IndicMonoDoc as compared
5
[Link] to English clean and English synthetic generated
syn-XX_yy-unfiltered: Denotes synthetic mono-
lingual documents in XX language generated by
using yy as source during translation.
syn-XX_yy-filtered: Filtered synthetic data.
+10%: Refers to extended pretraining on a cleaned
subset of IndicMonoDoc with an additional 10%
tokens compared to regular training.
BI-XX-YY Prefix: Denotes bilingual models
trained using an equal mixture of monolingual cor-
pora in XX and YY languages. We append an
_syn prefix to either XX or YY if a synthetic ver-
sion of that language is employed in training and a
Figure 5: Violin plot displaying the distribution of
lengths of clean and filtered English documents on dif- -parallel/nonparallel tag to denote whether a paral-
ferent data splits: en-clean (English web documents), lel version of XX and YY are used or not.
syn-en_hi (synthetic English documents translated from Note, for each split we only use the number of
Hindi), and syn-en_gu (synthetic English documents tokens that are required to reach the point of op-
translated from Gujarati). timality (Hoffmann et al., 2022) by the language
model. We mention other details in Appendix B.
from Hindi. This also gives reason for the deteriora-
5.2 Implementation and Training
tion in results in Table 4 due to Gujarati documents.
Figure 5 shows the distribution of lengths of fil- Tokenizer: We use a common byte-pair-encoding
tered documents by TinyLMs showing that they (BPE) (Sennrich et al., 2016b) tokenizer using Sen-
do not add any bias for shorter documents during tencepiece6 for all experiments. We train a shared
filtering. Although we do not experiment with all vocabulary of 56k subwords between three lan-
these languages, we believe that IndicMonoDoc guages, English, Hindi, and Gujarati by using 5
will be an invaluable resource for Indic LMs. Million randomly sampled sentences per language
and upsampling for Gujarati.
5 Experiments TinyLMs: We use Pytorch Lightning7 for our im-
plementations and train TinyLMs as described in
In this section, we describe the training procedure Section 3.2. We use hidden sizes of 768 and have
and datasets used for different models mentioned two variants, one with 4 layers (mini) and one with
in Section 3. We pretrain and fine-tune all of the 12 layers (base; same as GPT2-base) with 28M
mentioned models from scratch in both mono and and 85M non-embedding parameters respectively.
bilingual settings using a Causal Language Model- The mini models are trained on clean data with
ing (CLM) objective on NLG tasks and use a Lin- sequence lengths of 40968 (mini-4k) for filtering
ear classification head for all classification tasks. synthetic documents as described in Section 3.3.
We specify the sample of the dataset used for pre- On the other hand, for our main pre-training and
training and finetuning for each model and see the downstream fine-tuning experiments, we train mini
different effects of using synthetic corpora for pre- and base models with sequence lengths of 1024
training. (mini-1k and base-1k). Following Hoffmann et al.
(2022) we use 2.4 billion word tokens per language
5.1 Pretraining Data Settings
to compute optimal training of base models. Since
We refer to translated text or translationese as syn- Gujarati has only 900M tokens in our dataset, when-
thetic or syn and original or web-crawled data as ever Gujarati is involved, we train only the mini-1k
clean throughout our experiments. For the pre- model. For models involving English and Hindi,
training of all base models, we use the following we train both mini and base models. Additional
naming convention to denote our training splits for details are in Appendix B.
each model:
6
XX-clean: This is a clean subset sampled randomly [Link]
7
[Link]
from IndicMonoDoc where XX represents the lan- 8
We keep long sequence lengths to be able to handle long
guage English (EN), Hindi (HI) or Gujarati (GU). documents for filtering.
(a) Results on Hindi
NLU NLG
Model Headline Sentence Question
iXNLI bbc-a iitp-mr iitp-pr midas Avg. Wikibio Avg.
Gen. Summ. Gen.
HI-clean 73.61 81.75 72.58 79.73 80.34 77.60 27.54 23.64 24.84 52.16 32.04
syn-HI_en-unfiltered 72.87 77.92 64.36 76.22 79.91 74.26 27.29 22.93 24.22 50.14 31.14
syn-HI_en-unfiltered+10% 74.63 78.36 67.75 77.46 80.17 75.67 - - - - -
syn-HI_en-filtered 74.75 81.06 69.03 78.58 79.73 76.63 27.15 23.10 24.41 49.88 31.13
syn-HI_en-filtered+10% 74.49 80.94 71.61 79.92 80.64 77.52 - - - - -

(b) Results on Gujarati

NLU NLG
Model Headline Sentence Question
iXNLI iNLTK Avg. Avg.
Gen. Summ. Gen.
GU-clean 67.8 92.1 79.95 17.62 13.82 15.18 15.54
syn-GU_en-unfiltered 65.51 89.78 77.65 16.21 13.29 13.66 14.39
syn-GU_en-unfiltered+10% 66.83 90.11 78.47 - - - -
syn-GU_en-filtered 67.74 91.35 79.55 17.64 13.40 14.95 15.33
syn-GU_en-filtered+10% 68.04 92.41 80.23 - - - -

Table 1: Results for Hindi and Gujarati: NLU/NLG tasks on base-1k (Hindi) and mini-1k (Gujarati) models on
different clean and synthetic splits. Test accuracy for NLU tasks; Rouge-L F1 scores for NLG tasks. Bold values
represent the best amongst synthetic splits.

5.3 Downstream Tasks and Evaluation translations. We demonstrate the impact of filter-
We finetune the mini-1k and base-1k models for ing and adding additional clean data for extended
various classification, regression, and generation pretraining of LMs trained solely on synthetic text.
tasks. We do some hyperparameter tuning for each Additionally, we observe the effect of using the
task and then repeat them for different data splits. clean source text along with its translations (syn-
More hyperparameters and evaluation details can thetic parallel documents) on downstream tasks.
be found in Appendix B. For evaluations, we re- We follow the naming convention for different data
port our primary scores on IndicGLUE (Kakwani splits as specified in Section 5. We provide details
et al., 2020) and IndicXNLI (iXNLI) (Aggarwal for pretraining of each model in Appendix B.
et al., 2022) for Hindi and Gujarati and use the Filtered Synthetic Data is Competitive with
validation set of GLUE benchmark (Wang et al., Web Scraped Data: The results in Table 1 and
2018) for English. We also experiment with other 2 indicate that syn-HI_en-unfiltered, syn-GU_en-
generation tasks like CNN-Dailymail (Nallapati unfiltered, and syn-EN_hi-unfiltered exhibit lower
et al., 2016), DailogSum (Chen et al., 2021), XL- downstream performance compared to their fil-
Sum (Hasan et al., 2021), IndicNLG9 (Kumar et al., tered counterparts: syn-HI_en-filtered, syn-GU_en-
2022), FLoRes-200 (Team et al., 2022), IN22-Conv filtered, and syn-EN_hi-filtered, respectively. It is
& IN22-Gen (Gala et al., 2023) and use standard evident that filtering the synthetic documents using
evaluation metrics suitable for each task like ac- TinyLMs significantly improves the performance of
curacy, f1-score, Rouge-L (Lin, 2004) and chrF++ both NLU and NLG tasks. In Table 2, we observe
(Popović, 2017). that for tasks like CoLA (Warstadt et al., 2019),
language models trained solely on synthetic data
6 Results lag behind when compared to other tasks. This
suggests that synthetic corpora may lack certain
We now present our results which help establish important elements necessary for language models
the utility of synthetic data for language modeling. to perform competitively in linguistic acceptability
tasks, as opposed to LMs trained on clean, non-
6.1 Main Results
synthetic corpora.
In this section, we present results for Hindi, Gu- Fine-tuning on Web Scraped Data boosts per-
jarati, and English language models trained on formance: Even after filtering, we observe that
clean data, as well as synthetic data generated from language models trained solely on synthetic text
9
We only take first 4k examples of IndicNLG test split for slightly underperform LMs trained on clean text.
each task due the large test split of IndicNLG To address this issue, we conduct extended pretrain-
sst2 cola mrpc qnli qqp rte mnli-m mnli-mm stsb
Model Avg.
acc mcc f1 acc f1 acc acc acc pearson
EN-clean 90.94 40.26 87.4 84.98 84.47 65.34 77.84 77.96 82.67 76.87
syn-EN_hi-unfiltered 84.61 31.1 81.78 79.35 81.44 63.3 72.94 73.16 78.9 71.84
Mono syn-EN_hi-unfiltered + 10% 87.39 34.22 85.77 80.96 81.07 65.11 74.76 74.38 80.32 73.78
syn-EN_hi-filtered 88.3 34.03 86.55 83.59 83.64 63.17 75.6 75.41 81.1 74.60
syn-EN_hi-filtered + 10% 90.13 35.75 86.41 84.75 84.21 65.34 76.99 76.91 81.95 75.83
BI-EN-HI-clean 89.56 38.53 85.56 84.88 84.39 64.25 76.4 77.27 82.07 75.88
BI-EN-HI_syn-parallel-filtered 89.56 39.57 85.71 84.75 84.62 64.98 77.31 77.85 82.41 76.31
Bi BI-EN-HI_syn-nonparallel-filtered 89.79 38.68 86.92 85.08 84.06 65.34 77.15 77.55 83.01 76.40
BI-EN_syn-HI_syn-filtered 87.95 30.05 84.9 83.7 83.97 63.89 75.63 76.24 82.24 74.29
BI-EN_syn-HI_syn-filtered + 10% 89.1 35.45 85.34 84.53 84.18 65.7 76.64 77.24 82.1 75.59

Table 2: Results on English: Dev set of GLUE tasks for different synthetic splits on the base-1k model. Synthetic
LMs perform almost as well as clean LMs after filtering and further training with clean data. Bold values represent
the best amongst synthetic splits.

ing of LMs using clean data sourced from Indic- HI-clean model which is solely trained on clean
MonoDoc. The objective is to determine if this corpora. This implies that it is possible to train
additional training improves performance. We only multilingual models where some languages are
incorporate an additional 10% of clean data com- trained only over a clean subset and others on
pared to the LM’s previous training data. We see synthetic without deteriorating performance across
these results across all three languages, and for languages. We further see that using parallel data
Hindi and Gujarati, we see that by incorporating does not have much impact on multilingual models.
even a small amount of clean data, we observe
an increase in performance on downstream tasks, 6.2 Further Exploration
bringing the LM at par or closer to the performance Impact of source language for synthetic data
of a clean LM. We see an improvement in LMs generation: Choosing the right source language
trained using unfiltered synthetic corpora as well for synthetic corpora is crucial as it influences the
but we believe that filtering leads to the removal of characteristics of the generated translationese text.
noisy documents and thus better performance. We evaluate this using Hindi and Gujarati clean
documents from IndicMonoDoc, translating them
Model iXNLI bbc-a iitp-mr iitp-pr midas Avg.
HI-clean 68.74 80.25 67.74 77.05 78.33 74.42 into English. Since Gujarati has limited data (900M
syn-HI_en-unfiltered 67.32 77.92 65.63 76.81 77.58 73.05
syn-HI_en-filtered 69.48 78.98 65.16 77.43 77.33 73.68
tokens), we train a mini-1k model for fair compar-
syn-HI_en-filtered+10% 70.15 79.56 67.09 78.2 79.03 74.81 ison. In Table 4, we see that the synthetic text
Table 3: Effect of reducing model size for Hindi on
generated from Hindi achieves performance at par
IndicGLUE accuracy. All the results reported here are with the EN-clean model, while the synthetic text
on mini-1k. Bold values represent the best amongst from Gujarati significantly lags behind. This is
synthetic splits likely because Hindi is more macaronic than Gu-
jarati, i.e., a lot of Hindi text from the web consists
Using synthetic for one language doesn’t impact of Hinglish, resulting in better translationese text
performance in another: For many multilingual due to increased overlap between languages. This
language models, data imbalance causes a gap in can also be due to the weaker translations generated
performance across languages. But what if we can by the MT model. The performance gap is notable
combine synthetic data along with clean data for in tasks like STS benchmark, NLI (qnli and mnli),
training multilingual models, would the synthetic and CoLA, suggesting poorer translation quality
part deteriorate the performance of the multilingual from Gu→ En compared to Hi→ En.
model? To experiment with this, we train bilin- Impact of model size: Following Table 4 and 3, we
gual base-1k models over different combinations of see that even after scaling down we see consistent
clean and synthetic corpora for English and Hindi improvements for filtering and adding additional
and evaluate their performance on GLUE (Wang data, which empirically shows that indeed using
et al., 2018), and report performance on IndicNLG, synthetic text after filtering is a viable option for
and Machine translation in Appendix A. Follow- pretraining LMs of varying sizes. In Table 3 we
ing Table 2, we see that using Hindi synthetic data see that after filtering and extended pretraining,
does not affect its performance compared to BI-EN- synthetic text outperforms LMs trained on clean
sst2 cola mrpc qnli qqp rte mnli-m mnli-mm stsb
Model Avg.
acc mcc f1 acc f1 acc acc acc pearson
Original EN-clean 87.95 25.59 83.84 78.83 80.78 64.62 71.6 71.69 73.48 70.93
syn-EN_hi-unfiltered 87.53 19.77 79.02 76.49 77.96 55.4 69.65 70.14 67.37 67.04
Translationese
syn-EN_hi-filtered 87.61 22.81 81.95 77.63 80.57 56.31 70.19 70.89 69.29 68.58
Hi->En
syn-EN_hi-filtered + 10% 87.84 26.61 83.27 78.5 80.36 61.37 71.29 71.11 71.91 70.25
syn-EN_gu-unfiltered 83.11 17.66 78.53 66.01 77.68 53.6 63.21 64.55 27.33 59.08
Translationese
syn-EN_gu-filtered 85.66 21.15 81.45 66.35 77.36 54.15 66.27 65.72 26.16 60.47
Gu->En
syn-EN_gu-filtered + 10% 86.58 25.17 81.67 67.1 77.75 57.76 68.78 68.56 27.54 62.32

Table 4: Effect of source selection for generating synthetic data on the dev set of GLUE benchmark. All the results
reported here are on mini-1k. Bold values represent the best amongst synthetic splits

documents from the web in Hindi. marks compared to parallel synthetic documents.
This might be because there is no explicit align-
XLSum XLSum
Model Cnn Dialogsum Avg.
HG QG ment happening during training between parallel
EN-clean 23.87 24.05 16.08 20.39 21.10
syn-EN_hi-unfiltered 22.17 22.97 12.56 18.30 19.00
documents. See Table 6 for chrF++ scores on
syn-EN_hi-filtered 23.27 23.83 15.88 19.83 20.70 FLoRes-200 (Team et al., 2022), and Appendix
A for chrF++ and BLEU scores on IN22-Conv,
Table 5: Performance of English models on NLG tasks.
All the results reported here are on base-1k and use
IN22-Gen (Gala et al., 2023).
Rouge-L F1 scores.
7 Conclusion
Impact on NLG: Without extended pretraining, In this paper, we performed a first of its kind study
language models trained on synthetic text perform of the feasibility of using translationese data for
as well as those trained on clean documents, sug- training language models. We proposed a simple
gesting that for NLG tasks, synthetic data suffices pipeline involving the translation of documents at
for pretraining, eliminating the need for clean data. scale followed by filtering using small and effi-
This trend is evident across Hindi, Gujarati, and cient language models trained on clean data. We
English NLG results (Tables 1 and 5). As their per- then showed on a variety of downstream natural
formance matches models trained on clean data, we language understanding and generative tasks that
refrain from extended pretraining for NLG tasks, language models trained on unclean synthetic data
focusing primarily on abstractive summarization were only slightly inferior to those trained on origi-
for evaluating generation capabilities. nal data, however, filtered synthetic data with ex-
tended pretraining on clean data mostly eliminates
Model
FLORES this gap. We also observed a positive impact of syn-
EN-HI HI-EN Avg.
BI-EN-HI-clean 46.56 51.7 49.13 thetic data on TinyLMs fine-tuned on 10% clean
BI-EN-HI_syn-parallel-filtered 44.12 50.64 47.38 data. While we observed that the source language,
BI-EN-HI_syn-nonparallel-filtered 45.65 51.29 48.47
EN-GU GU-EN Avg.
and potential content, for synthetic data genera-
BI-EN-GU-clean 26.44 35.3 30.87 tion matters, it is clear that synthetic data can help
BI-EN-GU_syn-parallel-filtered 26.77 34.84 30.81 bridge the resource scarcity faced by a vast major-
BI-EN-GU_syn-nonparallel-filtered 26.7 36.54 31.62
ity of languages for language modeling. As a part
Table 6: chrF++ Scores on FLoRes translation task. EN- of this work, we also created IndicMonoDoc, the
HI models are based on base-1k and EN-GU models are largest collection of clean document-level datasets
based on mini-1k for 22 Indic languages and English, which we re-
lease along with our synthetic data, pipelines, and
Impact on Machine Translation: (MT) We fo- code. In the future, we aim to first generate syn-
cus on MT separately as a special case of NLG. thetic data at much larger scales and experiment
We hypothesized that using parallel synthetic docu- with large language models to push the boundaries
ments for bilingual models would improve transla- of language modeling for low-resource languages.
tion performance by enhancing alignment between
languages. However, our evaluation fails this hy- Limitations
pothesis. Results indicate that using nonparallel We consider the following limitations of our work.
synthetic documents yields similar translation per-
formance across language directions and bench- • Work mainly focuses on TinyLMs so not all
observations may carry over to large language a CC-0 License11 .
models, however, synthetic data generated
from translations can surely help fill knowl-
edge gaps. References
Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, and
• We could not experiment with entire test sets Benoît Sagot. 2022. Towards a Cleaner Document-
of IndicNLG tasks like Question Generation, Oriented Multilingual Crawled Corpus. arXiv e-
WikiBio generation, Headline Generation, and prints, page arXiv:2201.06642.
Sentence Summarization due to its vast test
Divyanshu Aggarwal, Vivek Gupta, and Anoop
split but we do not expect the main trends to Kunchukuttan. 2022. IndicXNLI: Evaluating multi-
change given that we already use 4000 exam- lingual inference for Indian languages. In Proceed-
ples per language. ings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 10994–11006,
• For GLUE tasks we report our numbers on Abu Dhabi, United Arab Emirates. Association for
the validation set and not on the test set for all Computational Linguistics.
models since the scale of our experiments was
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al-
large, automatically submitting results on the shamsi, Alessandro Cappelli, Ruxandra Cojocaru,
test set was not feasible. We follow existing Mérouane Debbah, Étienne Goffinet, Daniel Hesslow,
works doing the same. Plus our goal is not Julien Launay, Quentin Malartic, Daniele Mazzotta,
to achieve state-of-the-art results but rather Badreddine Noune, Baptiste Pannier, and Guilherme
Penedo. 2023. The falcon series of open language
to establish the utility of synthetic data by models.
observing trends.
Nikolay Bogoychev and Rico Sennrich. 2019. Domain,
• We have not manually verified synthetic data translationese and noise in synthetic data for neural
so despite cleaning using TinyLMs there are machine translation. CoRR, abs/1911.03362.
chances that there may be some toxic content
Ondrej Bojar, Vojtech Diatka, Pavel Rychlỳ, Pavel
or bad documents. However, this is a future Stranák, Vít Suchomel, Ales Tamchyna, and Daniel
work. Zeman. 2014. Hindencorp-hindi-english and hindi-
only corpus for machine translation. In LREC, pages
• Our framework places significant emphasis 3550–3555.
on the translation model’s performance. Nev-
ertheless, we are confident that this approach Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
will significantly contribute to enhancing the Neelakantan, Pranav Shyam, Girish Sastry, Amanda
performance of mid-resource languages, par- Askell, Sandhini Agarwal, Ariel Herbert-Voss,
ticularly those for which the translation model Gretchen Krueger, Tom Henighan, Rewon Child,
demonstrates considerable proficiency. Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
Ethical Considerations teusz Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish, Alec
As a part of this paper, we release monolingual and Radford, Ilya Sutskever, and Dario Amodei. 2020.
synthetic data. While we have taken care to remove Language models are few-shot learners. In Ad-
vances in Neural Information Processing Systems,
any toxic content, accidental occurrences may exist volume 33, pages 1877–1901. Curran Associates,
and thus we exercise caution when using our data Inc.
for training language models as they may produce
toxic outputs. Given that we have shown the utility Yulong Chen, Yang Liu, Liang Chen, and Yue Zhang.
2021. DialogSum: A real-life scenario dialogue sum-
of synthetic data for training LMs, it should be marization dataset. In Findings of the Association
possible to mass produce synthetic toxic data in for Computational Linguistics: ACL-IJCNLP 2021,
various languages leading to LMs that can generate pages 5062–5074, Online. Association for Computa-
multilingual toxic content. However, this opens up tional Linguistics.
research opportunities on how to detect and filter Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan.
toxic content from synthetically created data. 2020. A survey of multilingual neural machine trans-
We aim to release the code and models with an lation. ACM Comput. Surv., 53(5).
MIT License10 . The dataset will be released under 11
[Link]
10
[Link] public-domain/cc0/
Daniel Deutsch and Dan Roth. 2020. SacreROUGE: An for Computational Linguistics: ACL-IJCNLP 2021,
open-source library for using and developing sum- pages 4693–4703, Online. Association for Computa-
marization evaluation metrics. In Proceedings of tional Linguistics.
Second Workshop for NLP Open Source Software
(NLP-OSS), pages 120–125, Online. Association for Kenneth Heafield. 2011. KenLM: Faster and smaller
Computational Linguistics. language model queries. In Proceedings of the Sixth
Workshop on Statistical Machine Translation, pages
Sumanth Doddapaneni, Rahul Aralikatte, Gowtham 187–197, Edinburgh, Scotland. Association for Com-
Ramesh, Shreya Goyal, Mitesh M Khapra, Anoop putational Linguistics.
Kunchukuttan, and Pratyush Kumar. 2023. Towards
leaving no indic language behind: Building mono- Dan Hendrycks, Collin Burns, Steven Basart, Andy
lingual corpora, benchmark and models for indic Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-
languages. In Proceedings of the 61st Annual Meet- hardt. 2021. Measuring massive multitask language
ing of the Association for Computational Linguistics understanding. In 9th International Conference on
(Volume 1: Long Papers), pages 12402–12426. Learning Representations, ICLR 2021, Virtual Event,
Austria, May 3-7, 2021. [Link].
Sergey Edunov, Myle Ott, Michael Auli, and David
Grangier. 2018. Understanding back-translation at Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
scale. In Proceedings of the 2018 Conference on Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Empirical Methods in Natural Language Processing, Diego de Las Casas, Lisa Anne Hendricks, Johannes
pages 489–500, Brussels, Belgium. Association for Welbl, Aidan Clark, Tom Hennigan, Eric Noland,
Computational Linguistics. Katie Millican, George van den Driessche, Bogdan
Damoc, Aurelia Guy, Simon Osindero, Karen Si-
Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals,
small can language models be and still speak coherent and Laurent Sifre. 2022. Training compute-optimal
english? large language models.
Jay Gala, Pranjal A Chitale, A K Raghavan, Varun Divyanshu Kakwani, Anoop Kunchukuttan, Satish
Gumma, Sumanth Doddapaneni, Aswanth Kumar M, Golla, NC Gokul, Avik Bhattacharyya, Mitesh M
Janki Atul Nawale, Anupama Sujatha, Ratish Pudup- Khapra, and Pratyush Kumar. 2020. Indicnlpsuite:
pully, Vivek Raghavan, Pratyush Kumar, Mitesh M Monolingual corpora, evaluation benchmarks and
Khapra, Raj Dabre, and Anoop Kunchukuttan. 2023. pre-trained multilingual language models for indian
Indictrans2: Towards high-quality and accessible ma- languages. In Findings of the Association for Com-
chine translation models for all 22 scheduled indian putational Linguistics: EMNLP 2020, pages 4948–
languages. Transactions on Machine Learning Re- 4961.
search.

Martin Gellerstam. 1986. Translationese in swedish Yoon Kim and Alexander M. Rush. 2016. Sequence-
novels translated from english. Translation studies level knowledge distillation. In Proceedings of the
in Scandinavia, 1:88–95. 2016 Conference on Empirical Methods in Natu-
ral Language Processing, pages 1317–1327, Austin,
Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Texas. Association for Computational Linguistics.
2012. Building large monolingual dictionaries at the
Leipzig corpora collection: From 100 to 200 lan- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
guages. In Proceedings of the Eighth International method for stochastic optimization. In 3rd Inter-
Conference on Language Resources and Evaluation national Conference on Learning Representations,
(LREC’12), pages 759–765, Istanbul, Turkey. Euro- ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
pean Language Resources Association (ELRA). Conference Track Proceedings.

Gili Goldin, Ella Rabinovich, and Shuly Wintner. 2018. Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier
Native language identification with user generated Garcia, Christopher A Choquette-Choo, Katherine
content. In Proceedings of the 2018 Conference on Lee, Derrick Xin, Aditya Kusupati, Romi Stella,
Empirical Methods in Natural Language Processing, Ankur Bapna, et al. 2023. Madlad-400: A multilin-
pages 3591–3601, Brussels, Belgium. Association gual and document-level large audited dataset. arXiv
for Computational Linguistics. preprint arXiv:2309.04662.

Yvette Graham, Barry Haddow, and Philipp Koehn. Aman Kumar, Himani Shrotriya, Prachi Sahu, Amogh
2019. Translationese in machine translation eval- Mishra, Raj Dabre, Ratish Puduppully, Anoop
uation. CoRR, abs/1906.09833. Kunchukuttan, Mitesh M. Khapra, and Pratyush Ku-
mar. 2022. IndicNLG benchmark: Multilingual
Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Is- datasets for diverse NLG tasks in Indic languages.
lam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, In Proceedings of the 2022 Conference on Empiri-
M. Sohel Rahman, and Rifat Shahriyar. 2021. XL- cal Methods in Natural Language Processing, pages
sum: Large-scale multilingual abstractive summariza- 5363–5394, Abu Dhabi, United Arab Emirates. As-
tion for 44 languages. In Findings of the Association sociation for Computational Linguistics.
Hugo Laurençon, Lucile Saulnier, Thomas Wang, Jihyung Moon, Hyunchang Cho, and Eunjeong L. Park.
Christopher Akiki, Albert Villanova del Moral, Teven 2020. Revisiting round-trip translation for quality
Le Scao, Leandro Von Werra, Chenghao Mou, Ed- estimation. In Proceedings of the 22nd Annual Con-
uardo González Ponferrada, Huu Nguyen, et al. 2022. ference of the European Association for Machine
The bigscience roots corpus: A 1.6 tb composite mul- Translation, pages 91–104, Lisboa, Portugal. Euro-
tilingual dataset. Advances in Neural Information pean Association for Machine Translation.
Processing Systems, 35:31809–31826.
Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing
Chin-Yew Lin. 2004. ROUGE: A package for auto- Xiang, et al. 2016. Abstractive text summarization
matic evaluation of summaries. In Text Summariza- using sequence-to-sequence rnns and beyond. arXiv
tion Branches Out, pages 74–81, Barcelona, Spain. preprint arXiv:1602.06023.
Association for Computational Linguistics.
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Jingwei Ni, Zhijing Jin, Markus Freitag, Mrinmaya
Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na- Sachan, and Bernhard Schölkopf. 2022. Original or
man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth translated? a causal analysis of the impact of trans-
Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav lationese on machine translation performance. In
Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettle- Proceedings of the 2022 Conference of the North
moyer, Zornitsa Kozareva, Mona Diab, Veselin Stoy- American Chapter of the Association for Computa-
anov, and Xian Li. 2022. Few-shot learning with tional Linguistics: Human Language Technologies,
multilingual generative language models. In Proceed- pages 5303–5320, Seattle, United States. Association
ings of the 2022 Conference on Empirical Methods for Computational Linguistics.
in Natural Language Processing, pages 9019–9052,
Abu Dhabi, United Arab Emirates. Association for Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent
Computational Linguistics. Romary. 2019. Asynchronous pipelines for process-
ing huge corpora on medium to low resource infras-
Ilya Loshchilov and Frank Hutter. 2019. Decoupled tructures. Proceedings of the Workshop on Chal-
weight decay regularization. In 7th International lenges in the Management of Large Corpora (CMLC-
Conference on Learning Representations, ICLR 2019, 7) 2019. Cardiff, 22nd July 2019, pages 9 – 16,
New Orleans, LA, USA, May 6-9, 2019. OpenRe- Mannheim. Leibniz-Institut für Deutsche Sprache.
[Link].
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Yash Madhani, Mitesh M. Khapra, and Anoop
Jing Zhu. 2002. Bleu: a method for automatic evalu-
Kunchukuttan. 2023a. Bhasha-abhijnaanam: Native-
ation of machine translation. In Proceedings of the
script and romanized language identification for 22
40th Annual Meeting of the Association for Compu-
indic languages.
tational Linguistics, pages 311–318, Philadelphia,
Yash Madhani, Sushane Parthan, Priyanka Bedekar, Pennsylvania, USA. Association for Computational
Gokul Nc, Ruchi Khapra, Anoop Kunchukuttan, Linguistics.
Pratyush Kumar, and Mitesh Khapra. 2023b. Aksha-
rantar: Open Indic-language transliteration datasets Maja Popović. 2017. chrF++: words helping charac-
and models for the next billion users. In Findings ter n-grams. In Proceedings of the Second Confer-
of the Association for Computational Linguistics: ence on Machine Translation, pages 612–618, Copen-
EMNLP 2023, pages 40–57, Singapore. Association hagen, Denmark. Association for Computational Lin-
for Computational Linguistics. guistics.

Benjamin Marie, Raphael Rubino, and Atsushi Fujita. Maja Popović, Alberto Poncelas, Marija Brkic, and
2020. Tagged back-translation revisited: Why does Andy Way. 2020. Neural machine translation for
it really work? In Proceedings of the 58th Annual translating into Croatian and Serbian. In Proceedings
Meeting of the Association for Computational Lin- of the 7th Workshop on NLP for Similar Languages,
guistics, pages 5990–5997, Online. Association for Varieties and Dialects, pages 102–113, Barcelona,
Computational Linguistics. Spain (Online). International Committee on Compu-
tational Linguistics (ICCL).
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Ryan McDonald. 2020. On faithfulness and factu-
ality in abstractive summarization. In Proceedings Yiwei Qin, Weizhe Yuan, Graham Neubig, and Pengfei
of the 58th Annual Meeting of the Association for Liu. 2023. T5Score: Discriminative fine-tuning of
Computational Linguistics, pages 1906–1919, On- generative evaluation metrics. In Findings of the
line. Association for Computational Linguistics. Association for Computational Linguistics: EMNLP
2023, pages 15185–15202, Singapore. Association
Anthony McEnery, Paul Baker, Robert Gaizauskas, and for Computational Linguistics.
Hamish Cunningham. 2000. Emille: Building a cor-
pus of south asian languages. In Proceedings of the Ella Rabinovich and Shuly Wintner. 2015. Unsuper-
International Conference on Machine Translation vised identification of translationese. Transactions of
and Multilingual Applications in the new Millennium: the Association for Computational Linguistics, 3:419–
MT 2000. 432.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Rico Sennrich, Barry Haddow, and Alexandra Birch.
Millican, Jordan Hoffmann, Francis Song, John 2016b. Neural machine translation of rare words
Aslanides, Sarah Henderson, Roman Ring, Susan- with subword units. In Proceedings of the 54th An-
nah Young, Eliza Rutherford, Tom Hennigan, Ja- nual Meeting of the Association for Computational
cob Menick, Albin Cassirer, Richard Powell, George Linguistics, ACL 2016, August 7-12, 2016, Berlin,
van den Driessche, Lisa Anne Hendricks, Mari- Germany, Volume 1: Long Papers. The Association
beth Rauh, Po-Sen Huang, Amelia Glaese, Jo- for Computer Linguistics.
hannes Welbl, Sumanth Dathathri, Saffron Huang,
Jonathan Uesato, John Mellor, Irina Higgins, Anto- Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
nia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Abu Awal Md Shoeb, Abubakar Abid, Adam
Siddhant Jayakumar, Elena Buchatskaya, David Bud- Fisch, Adam R. Brown, Adam Santoro, Aditya
den, Esme Sutherland, Karen Simonyan, Michela Pa- Gupta, Adrià Garriga-Alonso, Agnieszka Kluska,
ganini, Laurent Sifre, Lena Martens, Xiang Lorraine Aitor Lewkowycz, Akshat Agarwal, Alethea Power,
Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Alex Ray, Alex Warstadt, Alexander W. Kocurek,
Gribovskaya, Domenic Donato, Angeliki Lazaridou, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par-
Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsim- rish, Allen Nie, Aman Hussain, Amanda Askell,
poukelli, Nikolai Grigorev, Doug Fritz, Thibault Sot- Amanda Dsouza, Ameet Rahane, Anantharaman S.
tiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Iyer, Anders Andreassen, Andrea Santilli, Andreas
Daniel Toyama, Cyprien de Masson d’Autume, Yujia Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K.
Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Lampinen, Andy Zou, Angela Jiang, Angelica Chen,
Aidan Clark, Diego de Las Casas, Aurelia Guy, Anh Vuong, Animesh Gupta, Anna Gottardi, Anto-
Chris Jones, James Bradbury, Matthew Johnson, nio Norelli, Anu Venkatesh, Arash Gholamidavoodi,
Blake Hechtman, Laura Weidinger, Iason Gabriel, Arfa Tabassum, Arul Menezes, Arun Kirubarajan,
William Isaac, Ed Lockhart, Simon Osindero, Laura Asher Mullokandov, Ashish Sabharwal, Austin Her-
Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, rick, Avia Efrat, Aykut Erdem, Ayla Karakas, and
Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Ko- et al. 2022. Beyond the imitation game: Quantifying
ray Kavukcuoglu, and Geoffrey Irving. 2022. Scaling and extrapolating the capabilities of language models.
language models: Methods, analysis & insights from CoRR, abs/2206.04615.
training gopher.
Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha,
Bheemaraj, Mayank Jobanputra, Raghavan AK, Bo Wen, and Yunfeng Liu. 2023. Roformer: En-
Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Ma- hanced transformer with rotary position embedding.
halakshmi J, Divyanshu Kakwani, Navneet Kumar,
Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, NLLB Team, Marta R. Costa-jussà, James Cross, Onur
Vivek Raghavan, Anoop Kunchukuttan, Pratyush Ku- Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Hef-
mar, and Mitesh Shantadevi Khapra. 2022. Samanan- fernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
tar: The largest publicly available parallel corpora Jean Maillard, Anna Sun, Skyler Wang, Guillaume
collection for 11 indic languages. Transactions of the Wenzek, Al Youngblood, Bapi Akula, Loic Bar-
Association for Computational Linguistics, 10:145– rault, Gabriel Mejia Gonzalez, Prangthip Hansanti,
162. John Hoffman, Semarley Jarrett, Kaushik Ram
Sadagopan, Dirk Rowe, Shannon Spruit, Chau
Ricardo Rei, Ana C Farinha, Chrysoula Zerva, Daan Tran, Pierre Andrews, Necip Fazil Ayan, Shruti
van Stigt, Craig Stewart, Pedro Ramos, Taisiya Bhosale, Sergey Edunov, Angela Fan, Cynthia
Glushkova, André F. T. Martins, and Alon Lavie. Gao, Vedanuj Goswami, Francisco Guzmán, Philipp
2021. Are references really needed? unbabel-IST Koehn, Alexandre Mourachko, Christophe Ropers,
2021 submission for the metrics shared task. In Pro- Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
ceedings of the Sixth Conference on Machine Trans- 2022. No language left behind: Scaling human-
lation, pages 1030–1040, Online. Association for centered machine translation.
Computational Linguistics.
James Thorne, Andreas Vlachos, Oana Cocarascu,
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Christos Christodoulopoulos, and Arpit Mittal. 2018.
Lavie. 2020. COMET: A neural framework for MT
The fact extraction and VERification (FEVER)
evaluation. In Proceedings of the 2020 Conference
shared task. In Proceedings of the First Workshop on
on Empirical Methods in Natural Language Process-
Fact Extraction and VERification (FEVER), pages 1–
ing (EMNLP), pages 2685–2702, Online. Association
9, Brussels, Belgium. Association for Computational
for Computational Linguistics.
Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016a. Improving neural machine translation models Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way.
with monolingual data. In Proceedings of the 54th 2018. Attaining the unattainable? reassessing claims
Annual Meeting of the Association for Computational of human parity in neural machine translation. In Pro-
Linguistics (Volume 1: Long Papers), pages 86–96, ceedings of the Third Conference on Machine Trans-
Berlin, Germany. Association for Computational Lin- lation: Research Papers, pages 113–123, Brussels,
guistics. Belgium. Association for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Roberto Luis López, Rui Ribeiro, Salomey Osei,
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Sampo Pyysalo, Sebastian Nagel, Shamik Bose,
Kaiser, and Illia Polosukhin. 2017. Attention is all Shamsuddeen Hassan Muhammad, Shanya Sharma,
you need. Advances in neural information processing Shayne Longpre, Somaieh Nikpoor, Stanislav Silber-
systems, 30. berg, Suhas Pai, Sydney Zink, Tiago Timponi Tor-
rent, Timo Schick, Tristan Thrush, Valentin Danchev,
Alex Wang, Amanpreet Singh, Julian Michael, Felix Vassilina Nikoulina, Veronika Laippala, Violette
Hill, Omer Levy, and Samuel R Bowman. 2018. Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Ta-
Glue: A multi-task benchmark and analysis platform lat, Arun Raja, Benjamin Heinzerling, Chenglei Si,
for natural language understanding. arXiv preprint Davut Emre Taşar, Elizabeth Salesky, Sabrina J.
arXiv:1804.07461. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea
Santilli, Antoine Chaffin, Arnaud Stiegler, Debajy-
Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan oti Datta, Eliza Szczechla, Gunjan Chhablani, Han
Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mos- Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan
quera, Bhargavi Paranjabe, Adina Williams, Tal Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Sai-
Linzen, and Ryan Cotterell. 2023. Findings of the ful Bari, Maged S. Al-shaibani, Matteo Manica, Ni-
BabyLM challenge: Sample-efficient pretraining on hal Nayak, Ryan Teehan, Samuel Albanie, Sheng
developmentally plausible corpora. In Proceedings Shen, Srulik Ben-David, Stephen H. Bach, Taewoon
of the BabyLM Challenge at the 27th Conference on Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Ur-
Computational Natural Language Learning, pages mish Thakker, Vikas Raunak, Xiangru Tang, Zheng-
1–34, Singapore. Association for Computational Lin- Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri,
guistics. Hadar Tojarieh, Adam Roberts, Hyung Won Chung,
Jaesung Tae, Jason Phang, Ofir Press, Conglong Li,
Alex Warstadt, Amanpreet Singh, and Samuel R. Bow- Deepak Narayanan, Hatim Bourfoune, Jared Casper,
man. 2019. Neural network acceptability judgments. Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia
Transactions of the Association for Computational Zhang, Mohammad Shoeybi, Myriam Peyrounette,
Linguistics, 7:625–641. Nicolas Patry, Nouamane Tazi, Omar Sanseviero,
Patrick von Platen, Pierre Cornette, Pierre François
BigScience Workshop, :, Teven Le Scao, Angela Fan, Lavallée, Rémi Lacroix, Samyam Rajbhandari, San-
Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel chit Gandhi, Shaden Smith, Stéphane Requena, Suraj
Hesslow, Roman Castagné, Alexandra Sasha Luc- Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet
cioni, François Yvon, Matthias Gallé, Jonathan Singh, Anastasia Cheveleva, Anne-Laure Ligozat,
Tow, Alexander M. Rush, Stella Biderman, Albert Arjun Subramonian, Aurélie Névéol, Charles Lover-
Webson, Pawan Sasanka Ammanamanchi, Thomas ing, Dan Garrette, Deepak Tunuguntla, Ehud Reiter,
Wang, Benoît Sagot, Niklas Muennighoff, Albert Vil- Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bog-
lanova del Moral, Olatunji Ruwase, Rachel Bawden, danov, Genta Indra Winata, Hailey Schoelkopf, Jan-
Stas Bekman, Angelina McMillan-Major, Iz Belt- Christoph Kalo, Jekaterina Novikova, Jessica Zosa
agy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pe- Forde, Jordan Clive, Jungo Kasai, Ken Kawamura,
dro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Liam Hazan, Marine Carpuat, Miruna Clinciu, Na-
Yacine Jernite, Julien Launay, Margaret Mitchell, joung Kim, Newton Cheng, Oleg Serikov, Omer
Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Antverg, Oskar van der Wal, Rui Zhang, Ruochen
Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani
Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun,
Chris Emezue, Christopher Klamm, Colin Leong, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov,
Daniel van Strien, David Ifeoluwa Adelani, Dragomir Vladislav Mikhailov, Yada Pruksachatkun, Yonatan
Radev, Eduardo González Ponferrada, Efrat Lev- Belinkov, Zachary Bamberger, Zdeněk Kasner, Al-
kovizh, Ethan Kim, Eyal Bar Natan, Francesco De ice Rueda, Amanda Pestana, Amir Feizpour, Ammar
Toni, Gérard Dupont, Germán Kruszewski, Giada Khan, Amy Faranak, Ana Santos, Anthony Hevia,
Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Antigona Unldreaj, Arash Aghagol, Arezoo Abdol-
Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar lahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh
Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Behroozi, Benjamin Ajibade, Bharat Saxena, Car-
Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, los Muñoz Ferrandis, Daniel McDuff, Danish Con-
Joseph Tobing, Joydeep Bhattacharjee, Khalid Al- tractor, David Lansky, Davis David, Douwe Kiela,
mubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Duong A. Nguyen, Edward Tan, Emi Baylor, Ez-
Leon Weber, Long Phan, Loubna Ben allal, Lu- inwanne Ozoani, Fatima Mirza, Frankline Onon-
dovic Tanguy, Manan Dey, Manuel Romero Muñoz, iwu, Habib Rezanejad, Hessie Jones, Indrani Bhat-
Maraim Masoud, María Grandury, Mario Šaško, tacharya, Irene Solaiman, Irina Sedenko, Isar Ne-
Max Huang, Maximin Coavoux, Mayank Singh, jadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis
Mike Tian-Jian Jiang, Minh Chien Vu, Moham- Sanz, Livia Dutra, Mairon Samagaio, Maraim El-
mad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, badri, Margot Mieskes, Marissa Gerchick, Martha
Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Akinlolu, Michael McKenna, Mike Qiu, Muhammed
Omar Espejel, Ona de Gibert, Paulo Villegas, Pe- Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Ra-
ter Henderson, Pierre Colombo, Priscilla Amuok, jani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel,
Quentin Lhoest, Rheza Harliman, Rishi Bommasani,
Ran An, Rasmus Kromann, Ryan Hao, Samira Al- affected by a small margin and coupled with results
izadeh, Sarmad Shubber, Silas Wang, Sourav Roy, in Table 2 showing that scores are not affected by
Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le,
using Hindi synthetic parallel data.
Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap,
Alfredo Palasciano, Alison Callahan, Anima Shukla,
Antonio Miranda-Escalada, Ayush Singh, Benjamin B Training and Evaluation
Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag
Jain, Chuxin Xu, Clémentine Fourrier, Daniel León B.1 Training
Periñán, Daniel Molano, Dian Yu, Enrique Manjava-
cas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, In this section, we list the dataset and hyperparam-
Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, eters used for training our models for the experi-
Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, ments. For the pretraining of the base models, we
Jonas Golde, Jose David Posada, Karthik Ranga-
sai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa keep a hard limit for the base-1k model as 2.38B
Shinzato, Madeleine Hahn de Bykhovetz, Maiko tokens and for the mini-1k model as 1B tokens. But
Takeuchi, Marc Pàmies, Maria A Castillo, Mari- for the TinyLM we relax this token limit until we
anna Nezhurina, Mario Sänger, Matthias Samwald, see overfitting. For our experiments, we use the
Michael Cullan, Michael Weinberg, Michiel De
Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank,
NVIDIA A100-SXM4-80GB GPUs.
Myungsun Kang, Natasha Seelam, Nathan Dahlberg,
Nicholas Michio Broad, Nikolaus Muellner, Pascale B.2 Extended pretraining
Fung, Patrick Haller, Ramya Chandrasekhar, Renata
Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline For the mini-1k models, we randomly sample 100M
Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, tokens from the clean subset of IndicMonoDoc for
Shlok S Deshmukh, Shubhanshu Mishra, Sid Ki- the target language, and for the base-1k model, we
blawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Ku- sample 200M for extended pretraining. We use
mar, Stefan Schweter, Sushil Bharati, Tanmay Laud,
Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Ya- the same hyperparameters as training and perform
nis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, extended pretraining for 2 epochs over this newly
Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli sampled clean data.
Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and
Thomas Wolf. 2022. Bloom: A 176b-parameter B.3 Fine-tuning
open-access multilingual language model.
For GLUE tasks we use the dev split on the clean
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,
Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and part and do hyperparameter tuning to achieve the
Colin Raffel. 2021. mT5: A massively multilingual best scores, and then we use the same hyperparam-
pre-trained text-to-text transformer. In Proceedings eters for all synthetic experiments. For IndicGLUE
of the 2021 Conference of the North American Chap- we follow a similar setting for the val split to find
ter of the Association for Computational Linguistics:
Human Language Technologies, pages 483–498, On- good hyperparameters and report results on the test
line. Association for Computational Linguistics. split like Kakwani et al. (2020). For all classifi-
cation and regression tasks, we use a single linear
Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021.
layer and use an appropriate activation function
Bartscore: Evaluating generated text as text genera-
tion. In Advances in Neural Information Processing for classification and regression respectively. We
Systems 34: Annual Conference on Neural Informa- use an Adam optimizer (Kingma and Ba, 2015)
tion Processing Systems 2021, NeurIPS 2021, De- with a learning rate of 1e−5 and a batch size of
cember 6-14, 2021, virtual, pages 27263–27277. 48. For NLG tasks we do extended pretraining
Mike Zhang and Antonio Toral. 2019. The effect of using a separator token in between the input and
translationese in machine translation test sets. CoRR, output sequence with an effective batch size of 768
abs/1906.08069. examples and only calculate loss for the output se-
quence. We use an AdamW optimizer (Loshchilov
A Additional results
and Hutter, 2019) with learning rate = 6e−4 , weight
We report additional results in this section. Tables decay = 1e−1 , β1 = 0.9, β2 = 0.95 and ϵ = 1e−5 .
7, 8 show the chrF++ and BLEU scores across three For translation, we randomly sample 1M parallel
translation evaluation benchmarks. This shows that sentence for each language pair from the samanan-
using parallel synthetic data does not deteriorate the tar corpus (Ramesh et al., 2022) and evaluate on
performance of the language model. Similar results FloRes (Team et al., 2022), IN22-Conv and IN22-
are shown in Table 9 for IndicNLG tasks where Gen (Gala et al., 2023). We list the batch size and
performance on Hindi generation tasks are only number of epochs of each task in Table 12.
IN22-Conv IN22-Gen FLORES
Model
EN-HI HI-EN EN-HI HI-EN EN-HI HI-EN
BI-EN-HI-clean 41.22 50.3 43.49 47.83 46.56 51.7
BI-EN-HI_syn-parallel-filtered 41.92 49.67 41.61 46.95 44.12 50.64
BI-EN-HI_syn-nonparallel-filtered 40.74 49.54 42.28 47.66 45.65 51.29
EN-GU GU-EN EN-GU GU-EN EN-GU GU-EN
BI-EN-GU-clean 35.85 41.27 22.95 31.83 26.44 35.3
BI-EN-GU_syn-parallel-filtered 34.36 41.86 22.93 30.84 26.77 34.84
BI-EN-GU_syn-nonparallel-filtered 34.49 42.08 23.06 32.81 26.7 36.54

Table 7: chrF++ Scores on FloRes, IN22-Conv and IN22-Gen splits for translation task. EN-HI models are based on
base-1k and EN-GU models are based on mini-1k. Bold values represent the best amongst synthetic splits.

IN22-Conv IN22-Gen FLORES

Model
EN-HI HI-EN EN-HI HI-EN EN-HI HI-EN
BI-EN-HI-clean 19.58 23.01 17.23 19.72 21.8 21.73
BI-EN-HI_syn-parallel-filtered 19.64 23.79 16.57 20.14 21.63 22.6
BI-EN-HI_syn-nonparallel-filtered 19.25 22.47 16.37 19.74 21.51 21.74
EN-GU GU-EN EN-GU GU-EN EN-GU GU-EN
BI-EN-GU-clean 10.24 15.19 4.65 7.92 5.44 9.57
BI-EN-GU_syn-parallel-filtered 11.24 15.7 4.87 8.44 6.7 10.02
BI-EN-GU_syn-nonparallel-filtered 10.86 15.57 5.07 9.07 6.17 10.03

Table 8: BLEU Scores on FloRes, IN22-Conv and IN22-Gen splits for translation task. EN-HI models are based on
base-1k and EN-GU models are based on mini-1k. Bold values represent the best amongst synthetic splits.

Headline Sentence Question Wikibio

Model
Generation Summarization Generation Generation
BI-EN-HI-clean 27.47 23.78 24.25 50.82
BI-EN-HI_syn-parallel-filtered 26.96 23.10 25.38 48.26
BI-EN-HI_syn-nonparallel-filtered 27.32 22.84 24.95 50.22

Table 9: Performance of Bilingual models on IndicNLG tasks. All the results reported here are on base-1k and use
Rouge-L F1 scores. Bold values represent the best amongst synthetic splits.
Task Batch size Epochs Metric
IndicXNLI 48 5 Accuracy
BBC-Articles 24 20 Accuracy
IITP-MR 24 20 Accuracy
Hyperparameter Value
vocab_size 56000 IITP-PR 48 20 Accuracy
val_every 0.05 MIDAS 48 20 Accuracy
bs 48 Headline Generation 768 2 Rouge-L F1
n_embed 768 Sentence Summarization 768 2 Rouge-L F1
num_blocks 4 Question Generation 768 2 Rouge-L F1
num_heads 16
WikiBio Generation 768 4 Rouge-L F1
head_size n_embed // num_heads
context_len 1024 iNLTK 48 20 Accuracy
block_size context_len sst2 48 10 Accuracy
attn_drop_value 0.1 CoLA 48 30 MCC
dropout 0.1 mrpc 48 30 F1
ffn_drop_value 0.1 qnli 48 10 Accuracy
use_flashattn TRUE
qqp 48 5 F1
ffn_scaling 4
positional_embedding rope’ rte 48 30 Accuracy
rotatory_embedding_dim head_size // 2 mnli-matched 48 5 Accuracy
lr 6.00E-04 mnli-mismatched 48 5 Accuracy
wd 1.00E-01 stsb 48 20 Pearson
beta_1 0.9 XLSum Headline Gen. 768 4 Rouge-L F1
beta_2 0.95
XLSum Question Gen. 768 4 Rouge-L F1
eps 1.00E-05
epochs 2 CNN Dailymail 768 4 Rouge-L F1
precision bf16 DialogSum 768 4 Rouge-L F1
accumulate_grad_batches 8 Samanantar 768 2 chrF++ / BLEU
gradient_clip_val 1
strategy ddp’
accelerator gpu’ Table 12: Hyperparameters used for finetuning tasks
warmup_steps 5000
num_workers 16
SHUFFLE_SEED 42
PIN_MEMORY TRUE
NUM__NODES 1 B.4 Evaluation
NUM_DEVICES 2
We use torch metrics12 to calculate accuracy, f1-
Table 10: Hyperparameters used for training the mini-1k score, Pearson correlation, Matthew’s correlation
model coefficient. We report chrF++ scores13 and BLEU
scores14 (Papineni et al., 2002) using the sacre-
BLEU15 implementation and Rouge-L f1 scores
using the sacreRouge (Deutsch and Roth, 2020)
implementation by the xl-sum repository16 .
Hyperparameter Value
We report English scores for NLU on the valida-
vocab_size 56000 tion split of the GLUE benchmark and test splits
val_every 0.05
bs 48 for XL-Sum, CNN Dailymail, and Dialogsum NLG
n_embed 768
num_blocks 12 benchmarks. For Hindi and Gujarati, we use the
num_heads 12
head_size n_embed // num_heads test split of IndicGLUE and IndicXNLI.
context_len 1024
block_size context_len For classification and regression tasks, we use
attn_drop_value 0.1
dropout 0.1
the models finetuned according to hyperparame-
ffn_drop_value
use_flashattn
0.1
TRUE
ters mentioned in Appendix B.3 to keep fair com-
ffn_scaling 4 parison for all models and mention results on the
positional_embedding rope’
rotatory_embedding_dim head_size // 2 final epoch. For generations on IndicNLG and En-
lr 6.00E-04
wd 1.00E-01 glish NLG tasks, we use beam search with a beam
beta_1 0.9
beta_2 0.95 width of 5, length penalty of 1.0, n_gram repetition
eps 1.00E-05
epochs 2
penalty of 4 n_grams with sampling set to false and
precision
accumulate_grad_batches
bf16
8
early stopping set to true. We also set a maximum
gradient_clip_val 1 generation length to 64 tokens. For the translation
strategy ddp’
accelerator gpu’ task, we follow a beam search with a beam width of
warmup_steps 5000
num_workers 16 5, maximum new tokens to 256 and early stopping
SHUFFLE_SEED 42
PIN_MEMORY TRUE
12
NUM__NODES 1 [Link]
NUM_DEVICES 2
stable/pages/[Link]
13
chrF++ signature
Table 11: Hyperparameters used for training the base-1k nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.4.0
model 14
sacreBLEU signature:
nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.4.0
15
[Link]
16
[Link]
to true. such cases. We also do the reverse using the 1B
indic-en version19 of IndicTrans2 to translate 5B
C Perplexity filtering Hindi tokens and 900M Gujarati tokens from Indic-
MonoDoc to English. Together these make up the
C.1 Creating synthetic data
unfiltered synthetic translationese data in English,
“Translationese” is a term used to describe peculiar- Hindi, and Gujarati. We use this corpus for the syn-
ities in the text translated into a specific language, thetic and clean+synthetic part of our experiments.
differentiating it from content originally written in
that language (Gellerstam, 1986). Translated texts C.2 Perplexity filtering
into the target language (via humans or machine-
Following Figure 2, we use these TinyLMs to fil-
generated) often show distinctive features that dif-
ter the generated synthetic translationese corpora
ferentiate them from their original counterparts in
from IndicTrans2. We do this by using perplex-
the target language. These disparities arise from
ity as a measure of document quality score. For
either the influence of the translation process itself
language models, perplexity quantifies how well
on the final product or the inherent “fingerprints”
a model predicts a sequence of tokens. A lower
of the source language subtly present in the tar-
perplexity indicates better predictive performance.
get language rendition (Rabinovich and Wintner,
It is calculated by:
2015). This is a common phenomenon in transla-
tion models where the target language translations N
( )
often show characteristics of the source language 1 X
PPL(W ) = exp − log pθ (wi | w<i )
and add bias to the evaluation of downstream tasks N
i
(Toral et al., 2018), (Zhang and Toral, 2019), (Gra-
ham et al., 2019). So far a lot of work on syn- where the negative log-likelihood measures the er-
thetic translated data has been done for using back ror of the model’s predictions. While calculat-
translations (Sennrich et al., 2016a), (Edunov et al., ing perplexity over a sequence of tokens W ∈
2018) for improving Machine translation perfor- w1 , w2 , . . . , wN we skip the first s tokens where
mance (Marie et al., 2020),(Bogoychev and Sen- s = 10, e = 1024 and calculate loss until only the
nrich, 2019),(Ni et al., 2022) or for classification first e tokens of the document. We find setting e
tasks like native language identification (Goldin to larger values can lead to higher variance in the
et al., 2018), etc. Tranlationese data has been used document scores due to the size of the TinyLM.
for many tasks but we explore the efficacy of us- After initial analysis, we choose s and e such that
ing translationese data for pretraining of language we remove the high uncertainty of the language at
models. We collect monolingual corpora in the the start of an unseen document and avoid penal-
source language as mentioned in section 4 and izing longer documents due to the fragility of the
utilize a powerful off-the-shelf translation model extrapolation ability of TinyLM20 . Note that it is
IndicTrans2 (Gala et al., 2023) to generate trans- important to choose e such that the language model
lationese data. We use the 1B en-indic version17 gives a uniform estimate of perplexity over an al-
of IndicTrans2 using beam search with a beam ready seen sequence of tokens ∈ ws , ws+1 , . . . , we .
value of 5 to translate 5 billion English tokens from For our experiments, we use the TinyLMs to score
IndicMonoDoc to Hindi and Gujarati. Since In- all synthetically generated translationese data and
dicTrans2 can only handle a max sentence length calculate a document score using the above method.
of 256 BPE tokens, we split the documents using Following Laurençon et al. (2022) we do subsam-
Moses Sentence Splitter18 to perform translations pling by thresholding document perplexity scores
into the target language at the sentence level and except Laurençon et al. (2022) did it using Ken-LM
then merge again to form documents. We also re- (Heafield, 2011) and we do it using our TinyLM.
pair translations that exceed in length 256 BPE We keep the threshold value such that we include
tokens using the TinyLM trained on clean corpora enough documents to reach the computed optimal
as mentioned in Section 5 to complete the sen- token count for pretraining experiments.
tence translation, we encounter only 0.002% of
19
[Link]
17
[Link] indictrans2-indic-en-1B
indictrans2-en-indic-1B 20
During experiments we saw that these TinyLMs can only
18
[Link] go up to a certain context length before deteriorating in quality.
Language IndicCorpv2 Ours out webpages that consist of a considerable amount
bn 926.00 5258.47 of English content using a simple script recognition
en 6501.00 11986.53 regex. We perform this scrapping majorly for the
gu 901.00 887.18 bottom 14 low-resource languages. We also add
hi 6107.00 11268.33 script-level recognition using Unicode characters22
kn 875.00 567.16 for each language before crawling a webpage to
ml 931.00 845.32
avoid scrapping non-Indic text.
mr 795.00 1066.76
ne 852.00 1542.39 D.2 Post processing
pa 732.00 449.61
A lot of crawled content consists of unwanted text
ta 476.00 2171.92
te 731.00 767.18 like HTML tags, emoticons, and text in another lan-
ur 667.00 2391.79 guage. We use manual filtering pipelines inspired
as 67.00 57.64 by OSCAR (Ortiz Suárez et al., 2019), (Abadji
brx 2.50 2.25 et al., 2022) to remove such content. We addition-
doi 0.10 0.37 ally use a language detection-based (LID) filtering
gom 31.90 2.91 using cld323 and IndicLID-FTN model (Madhani
kas 0.06 1.27 et al., 2023a) to discard languages not of inter-
mai 13.70 1.51 est. Following Doddapaneni et al. (2023) we per-
mni 0.60 0.99 form document filtering to remove offensive text
or 122.00 81.96 from the corpora using a list of offensive words and
sa 125.00 80.09 phrases extended from work by Team et al. (2022)
sat 4.00 3.05 which consists of offensive words in 209 languages.
sd 13.20 83.81 We also use a Romanized version of this list using
Table 13: Languagewise corpora size comparison in
the transliteration tool by Madhani et al. (2023b) to
Million tokens perform toxic document filtering in 17 languages.
Following Kakwani et al. (2020) & Doddapaneni
et al. (2023) we merge all the filtered corpus with
D IndicMonoDoc Wikipedia, OSCAR (Ortiz Suárez et al., 2019) and
some dumps of mC4 (Xue et al., 2021). Finally,
In this section, we describe the process of creat-
we perform deduplication at paragraph level using
ing the IndicMonoDoc corpus which is the largest
Murmurhash algorithm24 with a 128-bit unsigned
document-level corpora for Indic languages consist-
hash for each monolingual split of the corpora. Af-
ing of 39.5 billion tokens spanning 23 languages.
ter all post-processing steps, the language wise size
IndicMonoDoc comprises 27.5B Indic tokens and
of the corpora is mentioned in Table 13. A major
12B tokens of English tokens. Table 13 shows
chunk of the corpus is comprised of English, Hindi,
language language-wise deduplicated size of the
and Bengali which make up 72.15% of the corpora.
IndicMonoDoc corpus and Figure 3 shows a com-
parative 100% stacked bar plot with IndicCorpv2
which is a sentence level corpora.

D.1 Crawling
To extract URLs from the web we sample word
level n-grams; n=2,...,6 from a sample monolingual
corpora to create a list of keyword searches. We
then randomly merge k; k=1,..,4 keywords to form
a query. Using these queries we perform automatic
web searches to collect a large repository of URLs.
We merge this list with a manual list of sources to
perform URL-level deduplication. We crawl these
webpages leaving out some of them21 . We leave
22
[Link]
21 23
We leave webpages consisting of a [Link] file and [Link]
24
URLs containing offensive text or social media links [Link]

Indic LLM: Data and Tokenizer Insights
No ratings yet
Indic LLM: Data and Tokenizer Insights
9 pages
HARE - HumAn Priors - Key To Small Language Model Efficiency
No ratings yet
HARE - HumAn Priors - Key To Small Language Model Efficiency
10 pages
UnifiedCrawl: Boosting LLMs for Low-Resource Languages
No ratings yet
UnifiedCrawl: Boosting LLMs for Low-Resource Languages
19 pages
2311.09807 The Curious Decline of Linguistic Diversity - AI Garbage in Garbage Out
No ratings yet
2311.09807 The Curious Decline of Linguistic Diversity - AI Garbage in Garbage Out
10 pages
Multilinguality
No ratings yet
Multilinguality
10 pages
ANLP Lec09
No ratings yet
ANLP Lec09
50 pages
Data Generation Using Large Language Models For Text Classification
No ratings yet
Data Generation Using Large Language Models For Text Classification
17 pages
LLMs in Synthetic Text Classification
No ratings yet
LLMs in Synthetic Text Classification
18 pages
Enhancing Multilingual Prompt-Based Code Generation in Llms Via Zero-Shot Cross-Lingual Transfer
No ratings yet
Enhancing Multilingual Prompt-Based Code Generation in Llms Via Zero-Shot Cross-Lingual Transfer
11 pages
Multilingual Translation with LLMs
No ratings yet
Multilingual Translation with LLMs
14 pages
Enhancing Multilingual LLM Pretraining With Model-Based Data Selection
No ratings yet
Enhancing Multilingual LLM Pretraining With Model-Based Data Selection
24 pages
AI: Pre-Trained Language Models Review
No ratings yet
AI: Pre-Trained Language Models Review
15 pages
76 Main Long
No ratings yet
76 Main Long
13 pages
Dynamic Data Sampler For Cross-Language Transfer Learning in Large Language Models
No ratings yet
Dynamic Data Sampler For Cross-Language Transfer Learning in Large Language Models
5 pages
Unsupervised Data Validation Methods For
No ratings yet
Unsupervised Data Validation Methods For
10 pages
Low-Resource NMT Training Guide
No ratings yet
Low-Resource NMT Training Guide
14 pages
Democratizing Llms For Low-Resource Languages by Leveraging Their English Dominant Abilities With Linguistically-Diverse Prompts
No ratings yet
Democratizing Llms For Low-Resource Languages by Leveraging Their English Dominant Abilities With Linguistically-Diverse Prompts
16 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
Extending English LLMs Multilingually
No ratings yet
Extending English LLMs Multilingually
65 pages
Incidental Bilingualism in PaLM Translation
No ratings yet
Incidental Bilingualism in PaLM Translation
19 pages
Titullms: A Family of Bangla Llms With Comprehensive Benchmarking
No ratings yet
Titullms: A Family of Bangla Llms With Comprehensive Benchmarking
17 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
An Embarrassingly Simple Approach For Transfer Learning From Pretrained Language Models
No ratings yet
An Embarrassingly Simple Approach For Transfer Learning From Pretrained Language Models
7 pages
Multilingual Machine Translation Insights
No ratings yet
Multilingual Machine Translation Insights
16 pages
YAYI 2: Multilingual LLMs for Chinese
No ratings yet
YAYI 2: Multilingual LLMs for Chinese
16 pages
2023.emnlp Main.96SynthIE
No ratings yet
2023.emnlp Main.96SynthIE
20 pages
Unlocking The Power of LLMs - Transformative Use Cases Across Industries
No ratings yet
Unlocking The Power of LLMs - Transformative Use Cases Across Industries
44 pages
Transformers
No ratings yet
Transformers
2 pages
Paper 1
No ratings yet
Paper 1
44 pages
Llama Beyond English: An Empirical Study On Language Capability Transfer
No ratings yet
Llama Beyond English: An Empirical Study On Language Capability Transfer
10 pages
Transfer Learning in Image Captioning
No ratings yet
Transfer Learning in Image Captioning
17 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
2023.acl Tutorials.3
No ratings yet
2023.acl Tutorials.3
6 pages
AI-Enhanced Text Embeddings
No ratings yet
AI-Enhanced Text Embeddings
20 pages
PIRES Et Al 2023 - Sabia Portuguese Large Language Models
No ratings yet
PIRES Et Al 2023 - Sabia Portuguese Large Language Models
15 pages
Large Language Models Overview
No ratings yet
Large Language Models Overview
43 pages
1 PB
No ratings yet
1 PB
13 pages
Facebook AI's WMT20 News Translation Task Submission
No ratings yet
Facebook AI's WMT20 News Translation Task Submission
14 pages
LLM Survey
100% (1)
LLM Survey
43 pages
Titu LLM
No ratings yet
Titu LLM
19 pages
Fine-Tuning LLMs for Domain-Specific MT
No ratings yet
Fine-Tuning LLMs for Domain-Specific MT
9 pages
How Good Is Your Tokenizer? On The Monolingual Performance of Multilingual Language Models
No ratings yet
How Good Is Your Tokenizer? On The Monolingual Performance of Multilingual Language Models
28 pages
mmT5: Solving Language Hallucinations
No ratings yet
mmT5: Solving Language Hallucinations
31 pages
adaptMLLM Fine-Tuning Multilingual Language Models
No ratings yet
adaptMLLM Fine-Tuning Multilingual Language Models
24 pages
Improving Text Embeddings With Large Language Models:,, Microsoft Corporation
No ratings yet
Improving Text Embeddings With Large Language Models:,, Microsoft Corporation
20 pages
CUNI Submission For Low-Resource Languages in WMT News 2019
No ratings yet
CUNI Submission For Low-Resource Languages in WMT News 2019
7 pages
Synthetic Data Generation in Low-Resource Settings Via Fine-Tuning of Large Language Models
No ratings yet
Synthetic Data Generation in Low-Resource Settings Via Fine-Tuning of Large Language Models
12 pages
2025 Acl-Long 762
No ratings yet
2025 Acl-Long 762
16 pages
Yulan-Mini: An Open Data-Efficient Language Model
No ratings yet
Yulan-Mini: An Open Data-Efficient Language Model
47 pages
Google T5
No ratings yet
Google T5
67 pages
De La Rosa, J. (2022) BERTIN Efficient Pre-Training of A Spanish Language Model Using Perplexity Sampling
No ratings yet
De La Rosa, J. (2022) BERTIN Efficient Pre-Training of A Spanish Language Model Using Perplexity Sampling
11 pages
Advances in Neural Machine Translation
No ratings yet
Advances in Neural Machine Translation
10 pages
Transfer Learning with Text-to-Text Transformer
No ratings yet
Transfer Learning with Text-to-Text Transformer
67 pages
Introducing The Newspalm MBR and Qe Dataset: Llm-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data
No ratings yet
Introducing The Newspalm MBR and Qe Dataset: Llm-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data
18 pages
Deciphering The Lmpact of Pretraining Data On Large Language Models Through Machine Unlearning
No ratings yet
Deciphering The Lmpact of Pretraining Data On Large Language Models Through Machine Unlearning
20 pages
Lizhong Chen
No ratings yet
Lizhong Chen
11 pages
2023 07 28 Evolution of Language Models
No ratings yet
2023 07 28 Evolution of Language Models
73 pages
Seeds of StereotypesA Large-Scale Textual Analysis of Race and
No ratings yet
Seeds of StereotypesA Large-Scale Textual Analysis of Race and
17 pages
Agentic Systems - A Guide To Transforming Industries With Vertical AI Agents
No ratings yet
Agentic Systems - A Guide To Transforming Industries With Vertical AI Agents
31 pages
Darknet Data Mining - A Canadian Cyber-Crime
No ratings yet
Darknet Data Mining - A Canadian Cyber-Crime
13 pages
SCoFt_slf-Contrastive Fine-Tuning for Equitable Image Generation
No ratings yet
SCoFt_slf-Contrastive Fine-Tuning for Equitable Image Generation
22 pages
A Benchmark of Information Retrieval Tasks With Complex
No ratings yet
A Benchmark of Information Retrieval Tasks With Complex
25 pages
Dap - Domain - Ware Prompt Learning For Vision-And-language Navigation
No ratings yet
Dap - Domain - Ware Prompt Learning For Vision-And-language Navigation
5 pages
Dismantling Common Internet Services
No ratings yet
Dismantling Common Internet Services
4 pages
Developing Products Alert System Users Using HtmlData and
No ratings yet
Developing Products Alert System Users Using HtmlData and
9 pages
Do Language Models Care About Text Quality
No ratings yet
Do Language Models Care About Text Quality
14 pages
Building A Large Japanese Web Corpus
No ratings yet
Building A Large Japanese Web Corpus
17 pages
Generating Potent Poisons and Backdoors From
No ratings yet
Generating Potent Poisons and Backdoors From
21 pages
Image in Words
No ratings yet
Image in Words
45 pages
Retrieval Augmented Verification - Unveiling Disinformation
No ratings yet
Retrieval Augmented Verification - Unveiling Disinformation
12 pages
CLASSLA-web - Comparable Web Corpora of South Slavic2
No ratings yet
CLASSLA-web - Comparable Web Corpora of South Slavic2
12 pages
On Pretraining Data Diversity For Self-Supervised Learning
No ratings yet
On Pretraining Data Diversity For Self-Supervised Learning
16 pages
VeCAF - Vision-Language Collaborative Active Finetuning With
No ratings yet
VeCAF - Vision-Language Collaborative Active Finetuning With
13 pages
OPSD - Offensive Persian Social Media Dataset
No ratings yet
OPSD - Offensive Persian Social Media Dataset
16 pages
Compress - Align - Urating Image-Text Data With Human Knowledge2
No ratings yet
Compress - Align - Urating Image-Text Data With Human Knowledge2
13 pages
CUI24
No ratings yet
CUI24
11 pages
Timber Pricelist
No ratings yet
Timber Pricelist
2 pages
HR Policy RMG2
No ratings yet
HR Policy RMG2
7 pages
Quiz Unit 1
No ratings yet
Quiz Unit 1
3 pages
Networking in English Language Teaching
No ratings yet
Networking in English Language Teaching
2 pages
ASW Brochure 4-6yrs
No ratings yet
ASW Brochure 4-6yrs
5 pages
AEF3e L5 File9 QuickTest
No ratings yet
AEF3e L5 File9 QuickTest
2 pages
Science Prep.2 Unit One L 1
No ratings yet
Science Prep.2 Unit One L 1
8 pages
Crafting Essays on Racial Discrimination
100% (2)
Crafting Essays on Racial Discrimination
4 pages
1-12 To 4 Guardsman G Series
No ratings yet
1-12 To 4 Guardsman G Series
6 pages
Clemson Football Media Guide
No ratings yet
Clemson Football Media Guide
212 pages
Alright - I
No ratings yet
Alright - I
8 pages
Chapter 4
100% (2)
Chapter 4
50 pages
Environmental Education
No ratings yet
Environmental Education
24 pages
Kunsan To Inchon Bus Schedule
No ratings yet
Kunsan To Inchon Bus Schedule
2 pages
Mock Test 13
No ratings yet
Mock Test 13
3 pages
treatment of the 21 twenty-one paths of seven rays
No ratings yet
treatment of the 21 twenty-one paths of seven rays
23 pages
Capstone Project Question Bank
No ratings yet
Capstone Project Question Bank
11 pages
Expanded Polystyrene (EPS) in Road Construction-20 Years of Italian Experiences
No ratings yet
Expanded Polystyrene (EPS) in Road Construction-20 Years of Italian Experiences
8 pages
Junior Cert Music Set C Definitions Revision
100% (1)
Junior Cert Music Set C Definitions Revision
4 pages
Dap An - 9B - 2021
No ratings yet
Dap An - 9B - 2021
2 pages
Standard Operating Procedures: Food Storage & Leftovers
No ratings yet
Standard Operating Procedures: Food Storage & Leftovers
3 pages
Ceplattyn GT 10
No ratings yet
Ceplattyn GT 10
11 pages
NOTAMs for WIII and Nearby Airports
No ratings yet
NOTAMs for WIII and Nearby Airports
12 pages
Nature and Types of Companies 1
No ratings yet
Nature and Types of Companies 1
28 pages
Remigration or Reintegration? What Explains The Intentions of Overseas Filipino Workers
No ratings yet
Remigration or Reintegration? What Explains The Intentions of Overseas Filipino Workers
18 pages
Case Criteria
No ratings yet
Case Criteria
1 page
Numerical Investigation of Elliptical and Triangular Perforated Fins Under Forced Convection
100% (1)
Numerical Investigation of Elliptical and Triangular Perforated Fins Under Forced Convection
4 pages
ESSKA European ACL Revision Consensus
No ratings yet
ESSKA European ACL Revision Consensus
133 pages
Project Proposal Wafiuddin
No ratings yet
Project Proposal Wafiuddin
8 pages
Welder Qualification Record Overview
No ratings yet
Welder Qualification Record Overview
1 page

Do Not Worry If You Do Not Have Data

Uploaded by

Do Not Worry If You Do Not Have Data

Uploaded by

Do Not Worry if You Do Not Have Data:

Building Pretrained Language Models Using Translationese

Meet Doshi* , Raj Dabre** , and Pushpak Bhattacharyya*

Clean 76.87 77.60 79.95

Abstract English Hindi

translation for pre-training language models

Avg. across tasks

for languages other than English. Recently,

thetic data to address this data scarcity. We take

language models containing 28M and 85M pa-

exhaustive study of LM development methodolo-

(b) Results on Gujarati

IN22-Conv IN22-Gen FLORES

Headline Sentence Question Wikibio

You might also like