Module 5 NLP
Module 5 NLP
machine This chapter introduces machine translation (MT), the use of computers to trans-
translation
MT late from one language to another.
Of course translation, in its full generality, such as the translation of literature, or
poetry, is a difficult, fascinating, and intensely human endeavor, as rich as any other
area of human creativity.
Machine translation in its present form therefore focuses on a number of very
practical tasks. Perhaps the most common current use of machine translation is
information for information access. We might want to translate some instructions on the web,
access
perhaps the recipe for a favorite dish, or the steps for putting together some furniture.
Or we might want to read an article in a newspaper, or get information from an
online resource like Wikipedia or a government webpage in a foreign language.
MT for information
access is probably
one of the most com-
mon uses of NLP
technology, and Google
Translate alone (shown above) translates hundreds of billions of words a day be-
tween over 100 languages.
Another common use of machine translation is to aid human translators. MT sys-
post-editing tems are routinely used to produce a draft translation that is fixed up in a post-editing
phase by a human translator. This task is often called computer-aided translation
CAT or CAT. CAT is commonly used as part of localization: the task of adapting content
localization or a product to a particular language community.
Finally, a more recent application of MT is to in-the-moment human commu-
nication needs. This includes incremental translation, translating speech on-the-fly
before the entire sentence is complete, as is commonly used in simultaneous inter-
pretation. Image-centric translation can be used for example to use OCR of the text
on a phone camera image as input to an MT system to translate menus or street signs.
encoder- The standard algorithm for MT is the encoder-decoder network, also called the
decoder
sequence to sequence network, an architecture that can be implemented with RNNs
or with Transformers. We’ve seen in prior chapters that RNN or Transformer archi-
tecture can be used to do classification (for example to map a sentence to a positive
or negative sentiment tag for sentiment analysis), or can be used to do sequence la-
beling (for example to assign each word in an input sentence with a part-of-speech,
or with a named entity tag). For part-of-speech tagging, recall that the output tag is
associated directly with each input word, and so we can just model the tag as output
yt for each input word xt .
204 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
(a) (b)
Figure 11.1 Examples of other word order differences: (a) In German, adverbs occur in
initial position that in English are more natural later, and tensed verbs occur in second posi-
tion. (b) In Mandarin, preposition phrases expressing goals often occur pre-verbally, unlike
in English.
Fig. 11.1 shows examples of other word order differences. All of these word
order differences between languages can cause problems for translation, requiring
the system to do huge structural reorderings as it generates the output.
ANIMAL paw
etape
JOURNEY HUMAN
patte
BIRD
leg foot
ANIMAL CHAIR HUMAN
jambe pied
Figure 11.2 The complex overlap between English leg, foot, etc., and various French trans-
lations as discussed by Hutchins and Somers (1992).
y1 y2 … ym
Decoder
Context
Encoder
x1 x2 … xn
Figure 11.3 The encoder-decoder architecture. The context is a function of the hidden
representations of the input, and may be used by the decoder in a variety of ways.
p(y) = p(y1 )p(y2 |y1 )p(y3 |y1 , y2 )...P(ym |y1 , ..., ym−1 ) (11.7)
ht = g(ht−1 , xt ) (11.8)
yt = f (ht ) (11.9)
We only have to make one slight change to turn this language model with au-
source toregressive generation into a translation model that can translate from a source text
target in one language to a target text in a second: add an sentence separation marker at
the end of the source text, and then simply concatenate the target text. We briefly
introduced this idea of a sentence separator token in Chapter 9 when we considered
using a Transformer language model to do summarization, by training a conditional
language model.
If we call the source text x and the target text y, we are computing the probability
p(y|x) as follows:
p(y|x) = p(y1 |x)p(y2 |y1 , x)p(y3 |y1 , y2 , x)...P(ym |y1 , ..., ym−1 , x) (11.10)
Fig. 11.4 shows the setup for a simplified version of the encoder-decoder model
(we’ll see the full model, which requires attention, in the next section).
Fig. 11.4 shows an English source text (“the green witch arrived”), a sentence
separator token (<s>, and a Spanish target text (“llegó la bruja verde”). To trans-
late a source text, we run it through the network performing forward inference to
generate hidden states until we get to the end of the source. Then we begin autore-
gressive generation, asking for a word in the context of the hidden layer from the
end of the source input as well as the end-of-sentence marker. Subsequent words
are conditioned on the previous hidden state and the embedding for the last word
generated.
2 Later we’ll see how to use pairs of Transformers as well; it’s even possible to use separate architectures
for the encoder and decoder.
210 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
Target Text
hidden hn
layer(s)
embedding
layer
Separator
Source Text
Figure 11.4 Translating a single sentence (inference time) in the basic RNN version of encoder-decoder ap-
proach to machine translation. Source and target sentences are concatenated with a separator token in between,
and the decoder uses context information from the encoder’s last hidden state.
Let’s formalize and generalize this model a bit in Fig. 11.5. (To help keep things
straight, we’ll use the superscripts e and d where needed to distinguish the hidden
states of the encoder and the decoder.) The elements of the network on the left
process the input sequence x and comprise the encoder. While our simplified fig-
ure shows only a single network layer for the encoder, stacked architectures are the
norm, where the output states from the top layer of the stack are taken as the fi-
nal representation. A widely used encoder design makes use of stacked biLSTMs
where the hidden states from top layers from the forward and backward passes are
concatenated as described in Chapter 9 to provide the contextualized representations
for each time step.
Decoder
y1 y2 y3 y4 </s>
(output is ignored during encoding)
softmax
embedding
layer
x1 x2 x3 xn <s> y1 y2 y3 yn
Encoder
Figure 11.5 A more formal version of translating a sentence at inference time in the basic RNN-based
encoder-decoder architecture. The final hidden state of the encoder RNN, hen , serves as the context for the
decoder in its role as hd0 in the decoder RNN.
hidden state of the decoder. That is, the first decoder RNN cell uses c as its prior
hidden state hd0 . The decoder autoregressively generates a sequence of outputs, an
element at a time, until an end-of-sequence marker is generated. Each hidden state
is conditioned on the previous hidden state and the output generated in the previous
state.
y1 y2 yi
c …
Figure 11.6 Allowing every hidden state of the decoder (not just the first decoder state) to
be influenced by the context c produced by the encoder.
One weakness of this approach as described so far is that the influence of the
context vector, c, will wane as the output sequence is generated. A solution is to
make the context vector c available at each step in the decoding process by adding
it as a parameter to the computation of the current hidden state, using the following
equation (illustrated in Fig. 11.6):
Now we’re ready to see the full equations for this version of the decoder in the basic
encoder-decoder model, with context available at each decoding timestep. Recall
that g is a stand-in for some flavor of RNN and ŷt−1 is the embedding for the output
sampled from the softmax at the previous step:
c = hen
hd0 = c
htd = g(ŷt−1 , ht−1
d
, c)
zt = f (htd )
yt = softmax(zt ) (11.12)
Finally, as shown earlier, the output y at each time step consists of a softmax com-
putation over the set of possible outputs (the vocabulary, in the case of language
modeling or MT). We compute the most likely output at each time step by taking the
argmax over the softmax output:
There are also various ways to make the model a bit more powerful. For example,
we can help the model keep track of what has already been generated and what
hasn’t by conditioning the output layer y not just solely on the hidden state htd and
the context c but also on the output yt−1 generated at the previous timestep:
yt = softmax(ŷt−1 , zt , c) (11.14)
a target. Concatenated with a separator token, these source-target pairs can now
serve as training data.
For MT, the training data typically consists of sets of sentences and their transla-
tions. These can be drawn from standard datasets of aligned sentence pairs, as we’ll
discuss in Section 11.7.2. Once we have a training set, the training itself proceeds
as with any RNN-based language model. The network is given the source text and
then starting with the separator token is trained autoregressively to predict the next
word, as shown in Fig. 11.7.
Decoder
gold
llegó la bruja verde </s> answers
y1 y2 y3 y4 y5
softmax
ŷ
hidden
layer(s)
embedding
layer
x1 x2 x3 x4
Encoder
Figure 11.7 Training the basic RNN encoder-decoder approach to machine translation. Note that in the
decoder we usually don’t propagate the model’s softmax outputs ŷt , but use teacher forcing to force each input
to the correct gold value for training. We compute the softmax output distribution over ŷ in the decoder in order
to compute the loss at each token, which can then be averaged to compute a loss for the sentence.
Note the differences between training (Fig. 11.7) and inference (Fig. 11.4) with
respect to the outputs at each time step. The decoder during inference uses its own
estimated output yˆt as the input for the next time step xt+1 . Thus the decoder will
tend to deviate more and more from the gold target sentence as it keeps generating
teacher forcing more tokens. In training, therefore, it is more common to use teacher forcing in the
decoder. Teacher forcing means that we force the system to use the gold target token
from training as the next input xt+1 , rather than allowing it to rely on the (possibly
erroneous) decoder output yˆt . This speeds up training.
11.4 Attention
The simplicity of the encoder-decoder model is its clean separation of the encoder
— which builds a representation of the source text — from the decoder, which uses
this context to generate a target text. In the model as we’ve described it so far, this
context vector is hn , the hidden state of the last (nth) time step of the source text.
This final hidden state is thus acting as a bottleneck: it must represent absolutely
everything about the meaning of the source text, since the only thing the decoder
knows about the source text is what’s in this context vector. Information at the
11.4 • ATTENTION 213
beginning of the sentence, especially for long sentences, may not be equally well
represented in the context vector.
Encoder bottleneck
bottleneck Decoder
Figure 11.8 Requiring the context c to be only the encoder’s final hidden state forces all the
information from the entire source sentence to pass through this representational bottleneck.
y1 y2 yi
c1 c2 ci
Figure 11.9 The attention mechanism allows each hidden state of the decoder to see a
different, dynamic, context, which is a function of all the encoder hidden states.
The first step in computing ci is to compute how much to focus on each encoder
state, how relevant each encoder state is to the decoder state captured in hdi−1 . We
capture relevance by computing— at each state i during decoding—a score(hdi−1 , hej )
for each encoder state j.
dot-product The simplest such score, called dot-product attention, implements relevance as
attention
similarity: measuring how similar the decoder hidden state is to an encoder hidden
state, by computing the dot product between them:
The score that results from this dot product is a scalar that reflects the degree of
similarity between the two vectors. The vector of these scores across all the encoder
hidden states gives us the relevance of each encoder state to the current step of the
decoder.
To make use of these scores, we’ll normalize them with a softmax to create a
vector of weights, αi j , that tells us the proportional relevance of each encoder hidden
state j to the prior hidden decoder state, hdi−1 .
exp(score(hdi−1 , hej )
= P d e
(11.17)
k exp(score(hi−1 , hk ))
Finally, given the distribution in α, we can compute a fixed-length context vector for
the current decoder state by taking a weighted average over all the encoder hidden
states.
X
ci = αi j hej (11.18)
j
With this, we finally have a fixed-length context vector that takes into account
information from the entire encoder state that is dynamically updated to reflect the
needs of the decoder at each step of decoding. Fig. 11.10 illustrates an encoder-
decoder network with attention, focusing on the computation of one context vector
ci .
Decoder
X
↵ij hej ci
j yi yi+1
attention
.4 .3 .1 .2
weights
↵ij
hdi 1 · hej
hidden he1 he2 he3 hhen … hdi-1 hdi …
n ci-1
layer(s)
ci
x1 x2 x3 xn
yi-1 yi
Encoder
Figure 11.10 A sketch of the encoder-decoder network with attention, focusing on the computation of ci . The
context value ci is one of the inputs to the computation of hdi . It is computed by taking the weighted sum of all
the encoder hidden states, each weighted by their dot product with the prior decoder hidden state hdi−1 .
It’s also possible to create more sophisticated scoring functions for attention
models. Instead of simple dot product attention, we can get a more powerful function
that computes the relevance of each encoder hidden state to the decoder hidden state
by parameterizing the score with its own set of weights, Ws .
The weights Ws , which are then trained during normal end-to-end training, give the
network the ability to learn which aspects of similarity between the decoder and
encoder states are important to the current application. This bilinear model also
allows the encoder and decoder to use different dimensional vectors, whereas the
simple dot-product attention requires the encoder and decoder hidden states have
the same dimensionality.
greedy Choosing the single most probable token to generate at each step is called greedy
decoding; a greedy algorithm is one that make a choice that is locally optimal,
whether or not it will turn out to have been the best choice with hindsight.
Indeed, greedy search is not optimal, and may not find the highest probability
translation. The problem is that the token that looks good to the decoder now might
turn out later to have been the wrong choice!
search tree Let’s see this by looking at the search tree, a graphical representation of the
choices the decoder makes in searching for the best translation, in which we view
the decoding problem as a heuristic state-space search and systematically explore
the space of possible outputs. In such a search tree, the branches are the actions, in
this case the action of generating a token, and the nodes are the states, in this case
the state of having generated a particular prefix. We are searching for the best action
sequence, i.e. the target string with the highest probability. Fig. 11.11 demonstrates
the problem, using a made-up example. Notice that the most probable sequence is
ok ok ¡/s¿ (with a probability of .4*.7*1.0), but a greedy search algorithm will fail
to find it, because it incorrectly chooses yes as the first word since it has the highest
local probability.
Recall from Chapter 8 that for part-of-speech tagging we used dynamic pro-
gramming search (the Viterbi algorithm) to address this problem. Unfortunately,
dynamic programming is not applicable to generation problems with long-distance
dependencies between the output decisions. The only method guaranteed to find the
best solution is exhaustive search: computing the probability of every one of the V T
possible sentences (for some length value T ) which is obviously too slow.
Instead, decoding in MT and other sequence generation problems generally uses
beam search a method called beam search. In beam search, instead of choosing the best token
to generate at each timestep, we keep k possible tokens at each step. This fixed-size
beam width memory footprint k is called the beam width, on the metaphor of a flashlight beam
that can be parameterized to be wider or narrower.
Thus at the first step of decoding, we compute a softmax over the entire vocab-
ulary, assigning a probability to each word. We then select the k-best options from
this softmax output. These initial k outputs are the search frontier and these k initial
216 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
p(t3|source, t1,t2)
p(t2|source, t1)
ok 1.0 </s>
.7
yes 1.0 </s>
p(t1|source) .2
ok .1 </s>
.4
start .5 yes .3 ok 1.0 </s>
.1 .4
</s> yes 1.0 </s>
.2
</s>
t1 t2 t3
Figure 11.11 A search tree for generating the target string T = t1 ,t2 , ... from the vocabulary
V = {yes, ok, <s>}, given the source string, showing the probability of generating each token
from that state. Greedy search would choose yes at the first time step followed by yes, instead
of the globally most probable sequence ok ok.
Thus at each step, to compute the probability of a partial translation, we simply add
the log probability of the prefix translation so far to the log probability of generating
the next token. Fig. 11.13 shows the scoring for the example sentence shown in
Fig. 11.12, using some simple made-up probabilities. Log probabilities are negative
or 0, and the max of two log probabilities is the one that is greater (closer to 0).
Fig. 11.14 gives the algorithm.
One problem arises from the fact that the completed hypotheses may have differ-
ent lengths. Because models generally assign lower probabilities to longer strings,
a naive algorithm would also choose shorter strings for y. This was not an issue
during the earlier steps of decoding; due to the breadth-first nature of beam search
11.6 • E NCODER -D ECODER WITH T RANSFORMERS 217
arrived y2
the green y3
hd1 hd2 y2 y3
y1
a hd hd hd a
y1 1 2 2
EOS arrived … …
aardvark EOS the green mage
a .. ..
… the the
hd1 .. ..
aardvark
witch witch
EOS .. … …
start arrived zebra zebra
..
the
y2 y3
…
zebra a arrived
… …
aardvark aardvark
the y2 .. ..
green green
.. ..
witch who
hd1 hd2
… y3 …
the witch
zebra zebra
EOS the
hd1 hd2 hd2
Figure 11.12 Beam search decoding with a beam width of k = 2. At each time step, we choose the k best
hypotheses, compute the V possible extensions of each hypothesis, score the resulting k ∗V possible hypotheses
and choose the best k to continue. At time 1, the frontier is filled with the best 2 options from the initial state
of the decoder: arrived and the. We then extend each of those, compute the probability of all the hypotheses so
far (arrived the, arrived aardvark, the green, the witch) and compute the best 2 (in this case the green and the
witch) to be the search frontier to extend on the next step. On the arcs we show the decoders that we run to score
the extension words (although for simplicity we haven’t shown the context value ci that is input at each step).
all the hypotheses being compared had the same length. The usual solution to this is
to apply some form of length normalization to each of the hypotheses, for example
simply dividing the negative log probability by the number of words:
t
1X
score(y) = − log P(y|x) = − log P(yi |y1 , ..., yi−1 , x) (11.21)
T
i=1
Figure 11.13 Scoring for beam search decoding with a beam width of k = 2. We maintain the log probability
of each hypothesis in the beam by incrementally adding the logprob of generating each next token. Only the top
k paths are extended to the next step.
11.7.1 Tokenization
Machine translation systems generally use a fixed vocabulary, A common way to
wordpiece generate this vocabulary is with the BPE or wordpiece algorithms sketched in Chap-
ter 2. Generally a shared vocabulary is used for the source and target languages,
which makes it easy to copy tokens (like names) from source to target, so we build
the wordpiece/BPE lexicon on a corpus that contains both source and target lan-
guage data. Wordpieces use a special symbol at the beginning of each token; here’s
a resulting tokenization from the Google MT system (Wu et al., 2016):
words: Jet makers feud over seat width with big orders at stake
wordpieces: J et makers fe ud over seat width with big orders at stake
We gave the BPE algorithm in detail in Chapter 2; here’s more details on the
wordpiece algorithm, which is given a training corpus and a desired vocabulary size
V, and proceeds as follows:
1. Initialize the wordpiece lexicon with characters (for example a subset of Uni-
code characters, collapsing all the remaining characters to a special unknown
character token).
2. Repeat until there are V wordpieces:
(a) Train an n-gram language model on the training corpus, using the current
set of wordpieces.
(b) Consider the set of possible new wordpieces made by concatenating two
wordpieces from the current lexicon. Choose the one new wordpiece that
most increases the language model probability of the training corpus.
A vocabulary of 8K to 32K word pieces is commonly used.
11.7 • S OME PRACTICAL DETAILS ON BUILDING MT SYSTEMS 219
y0 , h0 ← 0
path ← ()
complete paths ← ()
state ← (c, y0 , h0 , path) ;initial state
frontier ← hstatei ;initial frontier
11.7.2 MT corpora
parallel corpus Machine translation models are trained on a parallel corpus, sometimes called a
bitext, a text that appears in two (or more) languages. Large numbers of paral-
Europarl lel corpora are available. Some are governmental; the Europarl corpus (Koehn,
2005), extracted from the proceedings of the European Parliament, contains between
400,000 and 2 million sentences each from 21 European languages. The United Na-
tions Parallel Corpus contains on the order of 10 million sentences in the six official
languages of the United Nations (Arabic, Chinese, English, French, Russian, Span-
ish) Ziemski et al. (2016). Other parallel corpora have been made from movie and
TV subtitles, like the OpenSubtitles corpus (Lison and Tiedemann, 2016), or from
general web text, like the ParaCrawl corpus of with 223 million sentence pairs be-
tween 23 EU languages and English extracted from the CommonCrawl Bañón et al.
(2020).
220 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
Sentence alignment
Standard training corpora for MT come as aligned pairs of sentences. When creating
new corpora, for example for underresourced languages or new domains, these sen-
tence alignments must be created. Fig. 11.15 gives a sample hypothetical sentence
alignment.
E1: “Good morning," said the little prince. F1: -Bonjour, dit le petit prince.
E2: “Good morning," said the merchant. F2: -Bonjour, dit le marchand de pilules perfectionnées qui
apaisent la soif.
E3: This was a merchant who sold pills that had
F3: On en avale une par semaine et l'on n'éprouve plus le
been perfected to quench thirst.
besoin de boire.
E4: You just swallow one pill a week and you F4: -C’est une grosse économie de temps, dit le marchand.
won’t feel the need for anything to drink.
E5: “They save a huge amount of time," said the merchant. F5: Les experts ont fait des calculs.
E6: “Fifty−three minutes a week." F6: On épargne cinquante-trois minutes par semaine.
E7: “If I had fifty−three minutes to spend?" said the F7: “Moi, se dit le petit prince, si j'avais cinquante-trois minutes
little prince to himself. à dépenser, je marcherais tout doucement vers une fontaine..."
E8: “I would take a stroll to a spring of fresh water”
Figure 11.15 A sample alignment between sentences in English and French, with sentences extracted from
Antoine de Saint-Exupery’s Le Petit Prince and a hypothetical translation. Sentence alignment takes sentences
e1 , ..., en , and f1 , ..., fn and finds minimal sets of sentences that are translations of each other, including single
sentence mappings like (e1 ,f1 ), (e4 -f3 ), (e5 -f4 ), (e6 -f6 ) as well as 2-1 alignments (e2 /e3 ,f2 ), (e7 /e8 -f7 ), and null
alignments (f5 ).
Given two documents that are translations of each other, we generally need two
steps to produce sentence alignments:
• a cost function that takes a span of source sentences and a span of target sen-
tences and returns a score measuring how likely these spans are to be transla-
tions.
• an alignment algorithm that takes these scores to find a good alignment be-
tween the documents.
Since it is possible to induce multilingual sentence embeddings (Artetxe and
Schwenk, 2019), cosine similarity of such embeddings provides a natural scoring
function (Schwenk, 2018). Thompson and Koehn (2019) give the following cost
function between two sentences or spans x,y from the source and target documents
respectively:
where nSents() gives the number of sentences (this biases the metric toward many
alignments of single sentences instead of aligning very large spans). The denom-
inator helps to normalize the similarities, and so x1 , ..., xS , y1 , ..., yS , are randomly
selected sentences sampled from the respective documents.
Usually dynamic programming is used as the alignment algorithm (Gale and
Church, 1993), in a simple extension of the the minimum edit distance algorithm we
introduced in Chapter 2.
Finally, it’s helpful to do some corpus cleanup by removing noisy sentence pairs.
This can involve handwritten rules to remove low-precision pairs (for example re-
moving sentences that are too long, too short, have different URLs, or even pairs
11.8 • MT E VALUATION 221
that are too similar, suggesting that they were copies rather than translations). Or
pairs can be ranked by their multilingual embedding cosine score and low-scoring
pairs discarded.
11.7.3 Backtranslation
We’re often short of data for training MT models, since parallel corpora may be
limited for particular languages or domains. However, often we can find a large
monolingual corpus, to add to the smaller parallel corpora that are available.
backtranslation Backtranslation is a way of making use of monolingual corpora in the target
language by creating synthetic bitexts. In backtranslation, we train an intermediate
target-to-source MT system on the small bitext to translate the monolingual target
data to the source language. Now we can add this synthetic bitext (natural target
sentences, aligned with MT-produced source sentences) to our training data, and
retrain our source-to-target MT model. For example suppose we want to translate
from Navajo to English but only have a small Navajo-English bitext, although of
course we can find lots of monolingual English data. We use the small bitext to build
an MT engine going the other way (from English to Navajo). Once we translate the
monolingual English text to Navajo, we can add this synthetic Navajo/English bitext
to our training data.
Backtranslation has various parameters. One is how we generate the backtrans-
lated data; we can run the decoder in greedy inference, or use beam search. Or
Monte Carlo we can do sampling, or Monte Carlo search. In Monte Carlo decoding, at each
search
timestep, instead of always generating the word with the highest softmax proba-
bility, we roll a weighted die, and use it to choose the next word according to its
softmax probability. This works just like the sampling algorithm we saw in Chap-
ter 3 for generating random sentences from n-gram language models. Imagine there
are only 4 words and the softmax probability distribution at time t is (the: 0.6, green:
0.2, a: 0.1, witch: 0.1). We roll a weighted die, with the 4 sides weighted 0.6, 0.2,
0.1, and 0.1, and chose the word based on which side comes up. Another parameter
is the ratio of backtranslated data to natural bitext data; we can choose to upsample
the bitext data (include multiple copies of each sentence).
In general backtranslation works surprisingly well; one estimate suggests that a
system trained on backtranslated text gets about 2/3 of the gain as would training on
the same amount of natural bitext (Edunov et al., 2018).
11.8 MT Evaluation
Translations can be evaluated along two dimensions, adequacy and fluency.
adequacy adequacy: how well the translation captures the exact meaning of the source sen-
tence. Sometimes called faithfulness or fidelity.
fluency fluency: how fluent the translation is in the target language (is it grammatical, clear,
readable, natural).
Both human and automatic evaluation metrics are used.
evaluate).
For example, along the dimension of fluency, we can ask how intelligible, how
clear, how readable, or how natural the MT output (the target text) is. We can give
the raters a scale, for example, from 1 (totally unintelligible) to 5 (totally intelligible,
or 1 to 100, and ask them to rate each sentence or paragraph of the MT output.
We can do the same thing to judge the second dimension, adequacy, using raters
to assign scores on a scale. If we have bilingual raters, we can give them the source
sentence and a proposed target sentence, and rate, on a 5-point or 100-point scale,
how much of the information in the source was preserved in the target. If we only
have monolingual raters but we have a good human translation of the source text,
we can give the monolingual raters the human reference translation and a target
machine translation and again rate how much information is preserved. If we use
a fine-grained enough scale, we can normalize raters by subtracting the mean from
their scores and dividing by the variance.
ranking An alternative is to do ranking: give the raters a pair of candidate translations,
and ask them which one they prefer.
While humans produce the best evaluations of machine translation output, run-
ning a human evaluation can be time consuming and expensive. In the next section
we introduce an automatic metric that, while less accurate than human evaluation, is
widely used because it can quickly evaluate potential system improvements, or even
be used as an automatic loss function for training.
Source
la verdad, cuya madre es la historia, émula del tiempo, depósito de las acciones,
testigo de lo pasado, ejemplo y aviso de lo presente, advertencia de lo por venir.
Reference
truth, whose mother is history, rival of time, storehouse of deeds,
witness for the past, example and counsel for the present, and warning for the future.
Candidate 1
truth, whose mother is history, voice of time, deposit of actions,
witness for the past, example and warning for the present, and warning for the future
Candidate 2
the truth, which mother is the history, émula of the time, deposition of the shares,
witness of the past, example and notice of the present, warning of it for coming
Figure 11.16 Intuition for BLEU: One of two candidate translations of a Spanish sentence
shares more n-grams, and especially longer n-grams, with the reference human translation.
BLEU combines these four n-gram precisions by taking their geometric mean.
In addition, BLEU penalizes candidate translations that are too short. Imagine
our machine translation engine returned the following terrible candidate translation
3 for the example in Fig. 11.16:
(11.24) for the
Because the words for and the and the bigram for the all appear in the human ref-
erence, n-gram precision alone will assign candidate 3 a great score, since it has
perfect unigram and bigram precisions of 1.0!
One option for dealing with this problem is to combine recall with precision,
but BLEU chooses another option: adding a brevity penalty over the whole corpus,
penalizing a system that produces translations that are on average shorter than the
reference translations. Let sys len be the sum of the length of all the candidate trans-
lation sentences, and ref len be the sum of the length of all the reference translation
sentences. If the candidate translations are shorter than the reference, we assign a
brevity penalty BP that is a function of their ratio:
ref len
BP = min 1, exp 1 −
sys len
4
! 1
4
Y
BLEU = BP × precn (11.25)
n=1
224 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
BLEU also work fine if we have multiple human reference translations for a
source sentence. In fact BLEU works better in this situation, since a source sentence
can be legitimately translated in many ways and n-gram precision will hence be
more robust. We just match an n-gram if it occurs in any of the references. And for
the brevity penalty, we choose for each candidate sentence the reference sentence
that is the closest in length to compute the ref len. But in practice most translation
corpora only have a single human translation to compare against.
Finally, implementing BLEU requires standardizing on many details of smooth-
ing and tokenization; for this reason it is recommended to use standard implemen-
tations like SACREBLEU (Post, 2018) rather than trying to implement BLEU from
scratch.
To get a confidence interval on a single BLEU score using the bootstrap test,
recall from Section 4.9 that we take our test set (or devset) and create thousands of
pseudo-testsets by repeatedly sampling with replacement from the original test set.
We now compute the BLEU score of each of the pseudo-testsets. If we drop the
top 2.5% and bottom 2.5% of the scores, the remaining scores will give us the 95%
confidence interval for the BLEU score of our system.
To compare two MT systems A and B, we draw the same set of pseudo-testsets,
and compute the BLEU scores for each of them. We then compute the percentage
of pseudo-test-sets in which A has a higher BLEU score than B.
BLEU: Limitations
While automatic metrics like BLEU are useful, they have important limitations.
BLEU is very local: a large phrase that is moved around might not change the
BLEU score at all, and BLEU can’t evaluate cross-sentence properties of a docu-
ment like its discourse coherence (Chapter 22). BLEU and similar automatic met-
rics also do poorly at comparing very different kinds of systems, such as comparing
human-aided translation against machine translation, or different machine transla-
tion architectures against each other (Callison-Burch et al., 2006). Such automatic
metrics are probably most appropriate when evaluating changes to a single system.
1 X 1 X
RBERT = max xi · x̃ j PBERT = max xi · x̃ j (11.27)
|x| x ∈x x̃ j ∈x̃ |x̃| x̃ ∈x̃ xi ∈x
i j
Published as a conference paper at ICLR 2020
226 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
7.94
the weather is
Reference
1.82
cold today (0.713 1.27)+(0.515 7.94)+...
7.90
RBERT =
<latexit sha1_base64="fGWl4NCvlvtMu17rjLtk25oWpdc=">AAACSHicbZBLS+RAFIUrPT7bVzsu3RQ2ghIIqVbpuBgQRZiVqNgqdJpQqa5oYeVB1Y1ME/Lz3Lic3fwGNy6UwZ2VNgtfBwoO372Xe+uEmRQaXPef1fgxMTk1PTPbnJtfWFxqLf8812muGO+xVKbqMqSaS5HwHgiQ/DJTnMah5BfhzUFVv7jlSos0OYNRxgcxvUpEJBgFg4JWcBoUPvA/UOwfnp6VJf6F/UhRVmy4Tpds+SBirjFxOt1N26AdslOjrrO7vWn7cpiCLouqwa6QTRyvUznX9hzPK4NW23XcsfBXQ2rTRrWOg9Zff5iyPOYJMEm17hM3g0FBFQgmedn0c80zym7oFe8bm1BzzKAYB1HidUOGOEqVeQngMX0/UdBY61Ecms6YwrX+XKvgd7V+DpE3KESS5cAT9rYoyiWGFFep4qFQnIEcGUOZEuZWzK6pyRFM9k0TAvn85a/mvOMQ1yEnpL13VMcxg1bRGtpABHXRHvqNjlEPMXSHHtATerburUfrv/Xy1tqw6pkV9EGNxisxMKq0</latexit>
sha1_base64="OJyoKlmBAgUA0KDtUcsH/di5BlI=">AAACSHicbZDLattAFIaPnLRJ3JvTLrsZYgoJAqFxGqwsCqal0FVJQ5wELCNG41EyZHRh5ijECL1EnqAv002X2eUZsumipXRR6Mj2Ipf+MPDznXM4Z/64UNKg7187raXlR49XVtfaT54+e/6is/7y0OSl5mLIc5Xr45gZoWQmhihRieNCC5bGShzFZx+a+tG50Ebm2QFOCzFO2UkmE8kZWhR1ov2oClFcYPX+4/5BXZN3JEw049Wm7/XpdogyFYZQr9ffci3aoTsL1Pd23265oZrkaOqqaXAb5FIv6DXOdwMvCOqo0/U9fyby0NCF6Q52/15+BYC9qHMVTnJepiJDrpgxI+oXOK6YRsmVqNthaUTB+Bk7ESNrM2aPGVezIGryxpIJSXJtX4ZkRm9PVCw1ZprGtjNleGru1xr4v9qoxCQYVzIrShQZny9KSkUwJ02qZCK14Kim1jCupb2V8FNmc0SbfduGQO9/+aE57HnU9+gX2h18hrlW4TVswCZQ6MMAPsEeDIHDN7iBn/DL+e78cH47f+atLWcx8wruqNX6B8dUrVw=</latexit>
sha1_base64="RInTcZkWiVBnf/ncBstCvatCtG4=">AAACSHicbZDPShxBEMZ7Nproxugaj14al4AyMEyvyoyHwGIQPImKq8LOMvT09mhjzx+6a0KWYV4iL5EnySXH3HwGLx4U8SDYs7sHo/mg4eNXVVT1F+VSaHDda6vxbmb2/Ye5+ebHhU+LS63lz6c6KxTjPZbJTJ1HVHMpUt4DAZKf54rTJJL8LLr6VtfPvnOlRZaewCjng4RepCIWjIJBYSs8DssA+A8od/eOT6oKf8VBrCgr113HI5sBiIRrTJyOt2EbtE22p8hzdrY27EAOM9BVWTfYNbKJ43dq59q+4/tV2Gq7jjsWfmvI1LS7O08/f3nLi4dh628wzFiR8BSYpFr3iZvDoKQKBJO8agaF5jllV/SC941NqTlmUI6DqPAXQ4Y4zpR5KeAxfTlR0kTrURKZzoTCpX5dq+H/av0CYn9QijQvgKdssiguJIYM16nioVCcgRwZQ5kS5lbMLqnJEUz2TRMCef3lt+a04xDXIUek3T1AE82hVbSG1hFBHuqifXSIeoih3+gG3aF76491az1Yj5PWhjWdWUH/qNF4BkPYrbk=</latexit>
1.27+7.94+1.82+7.90+8.88
Candidate
Figure 1: Illustration of the computation of the recall metric R BERT . Given the reference x and
Figure 11.18
candidate The computation
x̂, we compute of BERT
BERT embeddings andSCORE recall
pairwise cosinefrom reference
similarity. x and candidate
We highlight the greedy x̂,
from Figure 1 in Zhang et al. (2020). This version shows an
matching in red, and include the optional idf importance weighting.extended version of the metric in
which tokens are also weighted by their idf values.
We experiment with different models (Section 4), using the tokenizer provided with each model.
11.9 BiasGivenand Ethical
a tokenized referenceIssues
sentence x = hx , . . . , x i, the embedding model generates a se- 1 k
quence of vectors hx1 , . . . , xk i. Similarly, the tokenized candidate x̂ = hx̂1 , . . . , x̂m i is mapped
to hx̂1 , . . . , x̂l i. The main model we use is BERT, which tokenizes the input text into a sequence
of word pieces (Wu et al., 2016), where unknown words are split into several commonly observed
Machine
sequencestranslation
of characters.raises many of theforsame
The representation ethical
each word issues
piece that we’ve
is computed with a discussed
Transformerin
earlier
encoderchapters.
(Vaswani et For example,
al., 2017) consider
by repeatedly MT self-attention
applying systems translating from
and nonlinear Hungarian
transformations
in an alternating fashion. BERT embeddings have been shown to benefit various NLP tasks (Devlin
(which has the gender neutral pronoun ő) or Spanish
et al., 2019; Liu, 2019; Huang et al., 2019; Yang et al., 2019a).
(which often drops pronouns)
into English (in which pronouns are obligatory, and they have grammatical gender).
Similarity Measure The vector representation allows for a soft measure of similarity instead of
When translating
exact-string (Papineni a et
reference
al., 2002) to
or aheuristic
person(Banerjee
described without
& Lavie, 2005)specified
matching. gender, MT
The cosine
systems often default to male gender (Schiebinger 2014,x> x̂ Prates et
similarity of a reference token xi and a candidate token x̂j is kxiikkx̂j k . We use pre-normalized
j al. 2019). And
MT systems
vectors, whichoften
reducesassign gender according
this calculation to the innerto culture
product stereotypes of the sort we saw
x>i x̂j . While this measure considers
intokens
Section 6.11. the
in isolation, Fig. 11.19 embeddings
contextual shows examples from (Prates
contain information et al.,
from the 2019),
rest of in which
the sentence.
Hungarian
BERTS CORE The complete score matches each token in x to a token in x̂ to compute recall, ő
gender-neutral ő is a nurse is translated with she, but gender-neutral
isand
a CEO is translated
each token with inhe.x toPrates
in x̂ to a token et al.
compute (2019)We
precision. findusethat these
greedy stereotypes
matching can’t
to maximize
the matchingbe
completely similarity
accountedscore,2forwhere
by each tokenbias
gender is matched
in UStolabor
the most similar token
statistics, in thethe
because otherbi-
sentence. We combine precision and recall to compute an F1 measure. For a reference x and
ases are amplified
candidate x̂, the recall,by MT systems,
precision, withare:
and F1 scores pronouns being mapped to male or female
gender with aX probability higher than if the mapping was based on actual labor em-
1 1 X PBERT · RBERT
ployment
RBERT = statistics. max x> i x̂j , PBERT = max x>i x̂j , FBERT = 2 .
x̂ 2x̂ x 2x |x| x j |x̂| i PBERT + RBERT
i 2x x̂j 2x̂
to candidate translations, so they can abstain from giving incorrect translations that
may cause harm.
Another is the need for low-resource algorithms that can do translation to and
from the vast majority of the world’s languages, which do not have large parallel
texts available for training. This problem is exacerbated by the fact that cross-lingual
transfer and multilingual approaches to MT tend to focus on the case where one
of the languages is English (Anastasopoulos and Neubig, 2020). ∀ et al. (2020)
propose a participatory design process to encourage content creators, curators, and
low-resourced
languages language technologists who speak these low-resourced languages to participate in
development of MT algorithms. Their method uses online groups, mentoring, and
online infrastructure, and they report on a case study on developing MT algorithms
for low-resource African languages.
11.10 Summary
Machine translation is one of the most widely used applications of NLP, and the
encoder-decoder model, first developed for MT is a key tool that has applications
throughout NLP.
• Languages have divergences, both structural and lexical, that make translation
difficult.
• The linguistic field of typology investigates some of these differences; lan-
guages can be classified by their position along typological dimensions like
whether verbs precede their objects.
• Encoder-decoder networks are composed of an encoder network that takes
an input sequence and creates a contextualized representation of it, the con-
text. This context representation is then passed to a decoder which generates
a task-specific output sequence.
• The attention mechanism enriches the context vector to allowing the decoder
to view information from all the hidden states of the encoder, not just the last
hidden state.
• The encoder-decoder architecture can be implemented by RNNs or by Trans-
formers.
• For the decoder, choosing the single most probable token to generate at each
step is called greedy decoding.
• In beam search, instead of choosing the best token to generate at each timestep,
we keep k possible tokens at each step. This fixed-size memory footprint k is
called the beam width.
• Machine translation models are trained on a parallel corpus, sometimes called
a bitext, a text that appears in two (or more) languages.
• Backtranslation is a way of making use of monolingual corpora in the target
language by running a pilot MT engine backwards to create synthetic bitexts.
• MT is evaluated by measuring a translation’s adequacy (how well it captures
the meaning of the source sentence) and fluency (how fluent or natural it is
in the target language). Human evaluation is the gold standard, but automatic
evaluation metrics like BLEU, which measure word or n-gram overlap with
human translations, or more recent metrics based on embedding similarity, are
also commonly used.
228 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
Interlingua
sis age
ta gen
aly gu
rg
an la n
et era
Source Text:
Target Text:
lan tion
ce
Semantic/Syntactic
Transfer Semantic/Syntactic
ur
gu
Structure
so
ag
Structure
e
source Direct Translation target
text text
Figure 11.20 The Vauquois (1968) triangle.
The algorithms (except for the decoder) were published in full detail— encouraged
by the US government which had partially funded the work— which gave them a
huge impact on the research community (Brown et al. 1990, Brown et al. 1993).
By the turn of the century, most academic research on machine translation used the
statistical noisy channel model. Progress was made hugely easier by the develop-
ment of publicly available toolkits, like the GIZA toolkit (Och and Ney, 2003) which
implements IBM models 1–5 as well as the HMM alignment model.
Around the turn of the century, an extended approach, called phrase-based
phrase-based translation was developed, which was based on inducing translations for phrase-
translation
pairs (Och 1998, Marcu and Wong 2002, Koehn et al. (2003), Och and Ney 2004,
Deng and Byrne 2005, inter alia). A log linear formulation (Och and Ney, 2004)
was trained to directly optimize evaluation metrics like BLEU in a method known
MERT as Minimum Error Rate Training, or MERT (Och, 2003), also drawing from
speech recognition models (Chou et al., 1993). Popular toolkits were developed like
Moses Moses (Koehn et al. 2006, Zens and Ney 2007).
There were also approaches around the turn of the century that were based on
transduction
grammar syntactic structure (Chapter 12). Models based on transduction grammars (also
called synchronous grammars assign a parallel syntactic tree structure to a pair of
sentences in different languages, with the goal of translating the sentences by ap-
plying reordering operations on the trees. From a generative perspective, we can
view a transduction grammar as generating pairs of aligned sentences in two lan-
inversion
guages. Some of the most widely used models included the inversion transduction
transduction
grammar
grammar (Wu, 1996) and synchronous context-free grammars (Chiang, 2005),
MODERN HISTORY OF encoder-decoder approach HERE; (Kalchbren-
ner and Blunsom, 2013), (Cho et al., 2014), (Sutskever et al., 2014), etc
Beam-search has an interesting relationship with human language processing;
(Meister et al., 2020) show that beam search enforces the cognitive property of uni-
form information density in text. Uniform information density is the hypothe-
sis that human language processors tend to prefer to distribute information equally
across the sentence (Jaeger and Levy, 2007).
Research on evaluation of machine translation began quite early. Miller and
Beebe-Center (1958) proposed a number of methods drawing on work in psycholin-
guistics. These included the use of cloze and Shannon tasks to measure intelligibil-
ity as well as a metric of edit distance from a human translation, the intuition that
underlies all modern automatic evaluation metrics like BLEU. The ALPAC report
included an early evaluation study conducted by John Carroll that was extremely in-
fluential (Pierce et al., 1966, Appendix 10). Carroll proposed distinct measures for
fidelity and intelligibility, and had raters score them subjectively on 9-point scales.
230 C HAPTER 11 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
More recent work on evaluation has focused on coming up with automatic metrics,
include the work on BLEU discussed in Section 11.8.2 (Papineni et al., 2002), as
well as related measures like NIST (Doddington, 2002), TER (Translation Error
Rate) (Snover et al., 2006), Precision and Recall (Turian et al., 2003), and ME-
TEOR (Banerjee and Lavie, 2005).
Good surveys of the early history of MT are Hutchins (1986) and (1997). Niren-
burg et al. (2002) is a collection of early readings in MT.
See Croft (1990) or Comrie (1989) for introductions to typology.
Exercises