Encoder Decoder Models
Encoder Decoder Models
All
rights reserved. Draft of December 29, 2021.
CHAPTER
machine This chapter introduces machine translation (MT), the use of computers to trans-
translation
MT late from one language to another.
Of course translation, in its full generality, such as the translation of literature, or
poetry, is a difficult, fascinating, and intensely human endeavor, as rich as any other
area of human creativity.
Machine translation in its present form therefore focuses on a number of very
practical tasks. Perhaps the most common current use of machine translation is
information for information access. We might want to translate some instructions on the web,
access
perhaps the recipe for a favorite dish, or the steps for putting together some furniture.
Or we might want to read an article in a newspaper, or get information from an
online resource like Wikipedia or a government webpage in a foreign language.
MT for information
access is probably
one of the most com-
mon uses of NLP
technology, and Google
Translate alone (shown above) translates hundreds of billions of words a day be-
tween over 100 languages.
Another common use of machine translation is to aid human translators. MT sys-
post-editing tems are routinely used to produce a draft translation that is fixed up in a post-editing
phase by a human translator. This task is often called computer-aided translation
CAT or CAT. CAT is commonly used as part of localization: the task of adapting content
localization or a product to a particular language community.
Finally, a more recent application of MT is to in-the-moment human commu-
nication needs. This includes incremental translation, translating speech on-the-fly
before the entire sentence is complete, as is commonly used in simultaneous inter-
pretation. Image-centric translation can be used for example to use OCR of the text
on a phone camera image as input to an MT system to translate menus or street signs.
encoder- The standard algorithm for MT is the encoder-decoder network, also called the
decoder
sequence to sequence network, an architecture that can be implemented with RNNs
or with Transformers. We’ve seen in prior chapters that RNN or Transformer archi-
tecture can be used to do classification (for example to map a sentence to a positive
or negative sentiment tag for sentiment analysis), or can be used to do sequence la-
beling (for example to assign each word in an input sentence with a part-of-speech,
or with a named entity tag). For part-of-speech tagging, recall that the output tag is
associated directly with each input word, and so we can just model the tag as output
yt for each input word xt .
2 C HAPTER 10 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
(a) (b)
Figure 10.1 Examples of other word order differences: (a) In German, adverbs occur in
initial position that in English are more natural later, and tensed verbs occur in second posi-
tion. (b) In Mandarin, preposition phrases expressing goals often occur pre-verbally, unlike
in English.
Fig. 10.1 shows examples of other word order differences. All of these word
order differences between languages can cause problems for translation, requiring
the system to do huge structural reorderings as it generates the output.
ANIMAL paw
etape
JOURNEY ANIMAL
patte
BIRD
leg foot
HUMAN CHAIR HUMAN
jambe pied
Figure 10.2 The complex overlap between English leg, foot, etc., and various French trans-
lations as discussed by Hutchins and Somers (1992).
y1 y2 … ym
Decoder
Context
Encoder
x1 x2 … xn
Figure 10.3 The encoder-decoder architecture. The context is a function of the hidden
representations of the input, and may be used by the decoder in a variety of ways.
p(y) = p(y1 )p(y2 |y1 )p(y3 |y1 , y2 )...P(ym |y1 , ..., ym−1 ) (10.7)
ht = g(ht−1 , xt ) (10.8)
yt = f (ht ) (10.9)
We only have to make one slight change to turn this language model with au-
source toregressive generation into a translation model that can translate from a source text
target in one language to a target text in a second: add a sentence separation marker at
the end of the source text, and then simply concatenate the target text. We briefly
introduced this idea of a sentence separator token in Chapter 9 when we considered
using a Transformer language model to do summarization, by training a conditional
language model.
If we call the source text x and the target text y, we are computing the probability
p(y|x) as follows:
p(y|x) = p(y1 |x)p(y2 |y1 , x)p(y3 |y1 , y2 , x)...P(ym |y1 , ..., ym−1 , x) (10.10)
Fig. 10.4 shows the setup for a simplified version of the encoder-decoder model
(we’ll see the full model, which requires attention, in the next section).
Fig. 10.4 shows an English source text (“the green witch arrived”), a sentence
separator token (<s>, and a Spanish target text (“llegó la bruja verde”). To trans-
late a source text, we run it through the network performing forward inference to
generate hidden states until we get to the end of the source. Then we begin autore-
gressive generation, asking for a word in the context of the hidden layer from the
end of the source input as well as the end-of-sentence marker. Subsequent words
are conditioned on the previous hidden state and the embedding for the last word
generated.
2 Later we’ll see how to use pairs of Transformers as well; it’s even possible to use separate architectures
for the encoder and decoder.
8 C HAPTER 10 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
Target Text
hidden hn
layer(s)
embedding
layer
Separator
Source Text
Figure 10.4 Translating a single sentence (inference time) in the basic RNN version of encoder-decoder ap-
proach to machine translation. Source and target sentences are concatenated with a separator token in between,
and the decoder uses context information from the encoder’s last hidden state.
Let’s formalize and generalize this model a bit in Fig. 10.5. (To help keep things
straight, we’ll use the superscripts e and d where needed to distinguish the hidden
states of the encoder and the decoder.) The elements of the network on the left
process the input sequence x and comprise the encoder. While our simplified fig-
ure shows only a single network layer for the encoder, stacked architectures are the
norm, where the output states from the top layer of the stack are taken as the fi-
nal representation. A widely used encoder design makes use of stacked biLSTMs
where the hidden states from top layers from the forward and backward passes are
concatenated as described in Chapter 9 to provide the contextualized representations
for each time step.
Decoder
y1 y2 y3 y4 </s>
(output is ignored during encoding)
softmax
embedding
layer
x1 x2 x3 xn <s> y1 y2 y3 yn
Encoder
Figure 10.5 A more formal version of translating a sentence at inference time in the basic RNN-based
encoder-decoder architecture. The final hidden state of the encoder RNN, hen , serves as the context for the
decoder in its role as hd0 in the decoder RNN.
hidden state of the decoder. That is, the first decoder RNN cell uses c as its prior
hidden state hd0 . The decoder autoregressively generates a sequence of outputs, an
element at a time, until an end-of-sequence marker is generated. Each hidden state
is conditioned on the previous hidden state and the output generated in the previous
state.
y1 y2 yi
c …
Figure 10.6 Allowing every hidden state of the decoder (not just the first decoder state) to
be influenced by the context c produced by the encoder.
One weakness of this approach as described so far is that the influence of the
context vector, c, will wane as the output sequence is generated. A solution is to
make the context vector c available at each step in the decoding process by adding
it as a parameter to the computation of the current hidden state, using the following
equation (illustrated in Fig. 10.6):
Now we’re ready to see the full equations for this version of the decoder in the basic
encoder-decoder model, with context available at each decoding timestep. Recall
that g is a stand-in for some flavor of RNN and ŷt−1 is the embedding for the output
sampled from the softmax at the previous step:
c = hen
hd0 = c
htd = g(ŷt−1 , ht−1
d
, c)
zt = f (htd )
yt = softmax(zt ) (10.12)
Finally, as shown earlier, the output y at each time step consists of a softmax com-
putation over the set of possible outputs (the vocabulary, in the case of language
modeling or MT). We compute the most likely output at each time step by taking the
argmax over the softmax output:
Decoder
gold
llegó la bruja verde </s> answers
y1 y2 y3 y4 y5
softmax
ŷ
<latexit sha1_base64="qOO6QdUN4jKU+f2F+W9QLLbCasc=">AAAB8XicbVBNS8NAEJ3Urxq/qh69BIvgqSQq6rHoxWMF+4FtKJvttl262YTdiRBC/4UXD4p49d9489+4aXPQ1gcDj/dmmJkXxIJrdN1vq7Syura+Ud60t7Z3dvcq+wctHSWKsiaNRKQ6AdFMcMmayFGwTqwYCQPB2sHkNvfbT0xpHskHTGPmh2Qk+ZBTgkZ67I0JZunUtu1+perW3BmcZeIVpAoFGv3KV28Q0SRkEqkgWnc9N0Y/Iwo5FWxq9xLNYkInZMS6hkoSMu1ns4unzolRBs4wUqYkOjP190RGQq3TMDCdIcGxXvRy8T+vm+Dw2s+4jBNkks4XDRPhYOTk7zsDrhhFkRpCqOLmVoeOiSIUTUh5CN7iy8ukdVbzLmvn9xfV+k0RRxmO4BhOwYMrqMMdNKAJFCQ8wyu8Wdp6sd6tj3lrySpmDuEPrM8fWfaQDg==</latexit>
hidden
layer(s)
embedding
layer
x1 x2 x3 x4
the green witch arrived <s> llegó la bruja verde
Encoder
Figure 10.7 Training the basic RNN encoder-decoder approach to machine translation. Note that in the
decoder we usually don’t propagate the model’s softmax outputs ŷt , but use teacher forcing to force each input
to the correct gold value for training. We compute the softmax output distribution over ŷ in the decoder in order
to compute the loss at each token, which can then be averaged to compute a loss for the sentence.
Note the differences between training (Fig. 10.7) and inference (Fig. 10.4) with
respect to the outputs at each time step. The decoder during inference uses its own
estimated output yˆt as the input for the next time step xt+1 . Thus the decoder will
tend to deviate more and more from the gold target sentence as it keeps generating
teacher forcing more tokens. In training, therefore, it is more common to use teacher forcing in the
decoder. Teacher forcing means that we force the system to use the gold target token
from training as the next input xt+1 , rather than allowing it to rely on the (possibly
erroneous) decoder output yˆt . This speeds up training.
10.4 Attention
The simplicity of the encoder-decoder model is its clean separation of the encoder—
which builds a representation of the source text—from the decoder, which uses this
context to generate a target text. In the model as we’ve described it so far, this
context vector is hn , the hidden state of the last (nth) time step of the source text.
This final hidden state is thus acting as a bottleneck: it must represent absolutely
everything about the meaning of the source text, since the only thing the decoder
knows about the source text is what’s in this context vector (Fig. 10.8). Information
at the beginning of the sentence, especially for long sentences, may not be equally
well represented in the context vector.
attention The attention mechanism is a solution to the bottleneck problem, a way of
mechanism
allowing the decoder to get information from all the hidden states of the encoder,
not just the last hidden state.
In the attention mechanism, as in the vanilla encoder-decoder model, the context
vector c is a single vector that is a function of the hidden states of the encoder, that
is, c = f (he1 . . . hen ). Because the number of hidden states varies with the size of the
10.4 • ATTENTION 11
bottleneck
Encoder bottleneck
Decoder
Figure 10.8 Requiring the context c to be only the encoder’s final hidden state forces all the
information from the entire source sentence to pass through this representational bottleneck.
input, we can’t use the entire tensor of encoder hidden state vectors directly as the
context for the decoder.
The idea of attention is instead to create the single fixed-length vector c by taking
a weighted sum of all the encoder hidden states. The weights focus on (‘attend
to’) a particular part of the source text that is relevant for the token the decoder is
currently producing. Attention thus replaces the static context vector with one that
is dynamically derived from the encoder hidden states, different for each token in
decoding.
This context vector, ci , is generated anew with each decoding step i and takes
all of the encoder hidden states into account in its derivation. We then make this
context available during decoding by conditioning the computation of the current
decoder hidden state on it (along with the prior hidden state and the previous output
generated by the decoder), as we see in this equation (and Fig. 10.9):
y1 y2 yi
c1 c2 ci
Figure 10.9 The attention mechanism allows each hidden state of the decoder to see a
different, dynamic, context, which is a function of all the encoder hidden states.
The first step in computing ci is to compute how much to focus on each encoder
state, how relevant each encoder state is to the decoder state captured in hdi−1 . We
capture relevance by computing— at each state i during decoding—a score(hdi−1 , hej )
for each encoder state j.
dot-product The simplest such score, called dot-product attention, implements relevance as
attention
similarity: measuring how similar the decoder hidden state is to an encoder hidden
state, by computing the dot product between them:
The score that results from this dot product is a scalar that reflects the degree of
similarity between the two vectors. The vector of these scores across all the encoder
hidden states gives us the relevance of each encoder state to the current step of the
decoder.
To make use of these scores, we’ll normalize them with a softmax to create a
vector of weights, αi j , that tells us the proportional relevance of each encoder hidden
12 C HAPTER 10 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
αi j = softmax(score(hdi−1 , hej ) ∀ j ∈ e)
exp(score(hdi−1 , hej )
= P (10.16)
k exp(score(hi−1 , hk ))
d e
Finally, given the distribution in α, we can compute a fixed-length context vector for
the current decoder state by taking a weighted average over all the encoder hidden
states.
X
ci = αi j hej (10.17)
j
With this, we finally have a fixed-length context vector that takes into account
information from the entire encoder state that is dynamically updated to reflect the
needs of the decoder at each step of decoding. Fig. 10.10 illustrates an encoder-
decoder network with attention, focusing on the computation of one context vector
ci .
Decoder
X
ci
<latexit sha1_base64="TNdNmv/RIlrhPa6LgQyjjQLqyBA=">AAACAnicdVDLSsNAFJ3UV62vqCtxM1gEVyHpI9Vd0Y3LCvYBTQyT6bSddvJgZiKUUNz4K25cKOLWr3Dn3zhpK6jogQuHc+7l3nv8mFEhTfNDyy0tr6yu5dcLG5tb2zv67l5LRAnHpIkjFvGOjwRhNCRNSSUjnZgTFPiMtP3xRea3bwkXNAqv5SQmboAGIe1TjKSSPP3AEUngjVIHsXiIvJSOpnB4Q7zR1NOLpmGaVbtqQdOwLbtk24qY5Yp9VoOWsjIUwQINT393ehFOAhJKzJAQXcuMpZsiLilmZFpwEkFihMdoQLqKhiggwk1nL0zhsVJ6sB9xVaGEM/X7RIoCISaBrzoDJIfit5eJf3ndRPZP3ZSGcSJJiOeL+gmDMoJZHrBHOcGSTRRBmFN1K8RDxBGWKrWCCuHrU/g/aZUMyzbKV5Vi/XwRRx4cgiNwAixQA3VwCRqgCTC4Aw/gCTxr99qj9qK9zltz2mJmH/yA9vYJSymYCA==</latexit>
↵ij hej
j yi yi+1
attention
.4 .3 .1 .2
weights
↵ij
hdi · hej
<latexit sha1_base64="y8s4mGdpwrGrBnuSR+p1gJJXYdo=">AAAB/nicdVDJSgNBEO2JW4zbqHjy0hgEL4YeJyQBL0EvHiOYBbIMPT09mTY9C909QhgC/ooXD4p49Tu8+Td2FkFFHxQ83quiqp6bcCYVQh9Gbml5ZXUtv17Y2Nza3jF391oyTgWhTRLzWHRcLClnEW0qpjjtJILi0OW07Y4up377jgrJ4uhGjRPaD/EwYj4jWGnJMQ+Cgedk7NSa9IgXq955MKDOrWMWUQnNAFGpYtfsakUTZNtWGUFrYRXBAg3HfO95MUlDGinCsZRdCyWqn2GhGOF0UuilkiaYjPCQdjWNcEhlP5udP4HHWvGgHwtdkYIz9ftEhkMpx6GrO0OsAvnbm4p/ed1U+bV+xqIkVTQi80V+yqGK4TQL6DFBieJjTTARTN8KSYAFJkonVtAhfH0K/yets5JVKdnX5WL9YhFHHhyCI3ACLFAFdXAFGqAJCMjAA3gCz8a98Wi8GK/z1pyxmNkHP2C8fQICDpWK</latexit>
ci
x1 x2 x3 xn
yi-1 yi
Encoder
Figure 10.10 A sketch of the encoder-decoder network with attention, focusing on the computation of ci . The
context value ci is one of the inputs to the computation of hdi . It is computed by taking the weighted sum of all
the encoder hidden states, each weighted by their dot product with the prior decoder hidden state hdi−1 .
It’s also possible to create more sophisticated scoring functions for attention
models. Instead of simple dot product attention, we can get a more powerful function
that computes the relevance of each encoder hidden state to the decoder hidden state
by parameterizing the score with its own set of weights, Ws .
The weights Ws , which are then trained during normal end-to-end training, give the
network the ability to learn which aspects of similarity between the decoder and
encoder states are important to the current application. This bilinear model also
allows the encoder and decoder to use different dimensional vectors, whereas the
simple dot-product attention requires that the encoder and decoder hidden states
have the same dimensionality.
10.5 • B EAM S EARCH 13
greedy Choosing the single most probable token to generate at each step is called greedy
decoding; a greedy algorithm is one that make a choice that is locally optimal,
whether or not it will turn out to have been the best choice with hindsight.
Indeed, greedy search is not optimal, and may not find the highest probability
translation. The problem is that the token that looks good to the decoder now might
turn out later to have been the wrong choice!
search tree Let’s see this by looking at the search tree, a graphical representation of the
choices the decoder makes in searching for the best translation, in which we view
the decoding problem as a heuristic state-space search and systematically explore
the space of possible outputs. In such a search tree, the branches are the actions, in
this case the action of generating a token, and the nodes are the states, in this case
the state of having generated a particular prefix. We are searching for the best action
sequence, i.e. the target string with the highest probability. Fig. 10.11 demonstrates
the problem, using a made-up example. Notice that the most probable sequence is
ok ok </s> (with a probability of .4*.7*1.0), but a greedy search algorithm will fail
to find it, because it incorrectly chooses yes as the first word since it has the highest
local probability.
p(t3|source, t1,t2)
p(t2|source, t1)
ok 1.0 </s>
.7
yes 1.0 </s>
p(t1|source) .2
ok .1 </s>
.4
start .5 yes .3 ok 1.0 </s>
.1 .4
</s> yes 1.0 </s>
.2
</s>
t1 t2 t3
Figure 10.11 A search tree for generating the target string T = t1 ,t2 , ... from the vocabulary
V = {yes, ok, <s>}, given the source string, showing the probability of generating each token
from that state. Greedy search would choose yes at the first time step followed by yes, instead
of the globally most probable sequence ok ok.
Recall from Chapter 8 that for part-of-speech tagging we used dynamic pro-
gramming search (the Viterbi algorithm) to address this problem. Unfortunately,
dynamic programming is not applicable to generation problems with long-distance
dependencies between the output decisions. The only method guaranteed to find the
14 C HAPTER 10 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
best solution is exhaustive search: computing the probability of every one of the V T
possible sentences (for some length value T ) which is obviously too slow.
Instead, decoding in MT and other sequence generation problems generally uses
beam search a method called beam search. In beam search, instead of choosing the best token
to generate at each timestep, we keep k possible tokens at each step. This fixed-size
beam width memory footprint k is called the beam width, on the metaphor of a flashlight beam
that can be parameterized to be wider or narrower.
Thus at the first step of decoding, we compute a softmax over the entire vocab-
ulary, assigning a probability to each word. We then select the k-best options from
this softmax output. These initial k outputs are the search frontier and these k initial
words are called hypotheses. A hypothesis is an output sequence, a translation-so-
far, together with its probability.
arrived y2
the green y3
hd1 hd2 y2 y3
y1
a a
y1 hd1 hd2 hd2
EOS arrived … …
aardvark EOS the green mage
a .. ..
hd1
… the the
aardvark .. ..
witch witch
EOS .. … …
start arrived zebra zebra
..
the
y2 y3
…
zebra a arrived
… …
aardvark aardvark
the y2 .. ..
green green
.. ..
witch who
hd1 hd2 … y3 …
the witch
zebra zebra
EOS the
hd1 hd2 hd2
Figure 10.12 Beam search decoding with a beam width of k = 2. At each time step, we choose the k best
hypotheses, compute the V possible extensions of each hypothesis, score the resulting k ∗V possible hypotheses
and choose the best k to continue. At time 1, the frontier is filled with the best 2 options from the initial state
of the decoder: arrived and the. We then extend each of those, compute the probability of all the hypotheses so
far (arrived the, arrived aardvark, the green, the witch) and compute the best 2 (in this case the green and the
witch) to be the search frontier to extend on the next step. On the arcs we show the decoders that we run to score
the extension words (although for simplicity we haven’t shown the context value ci that is input at each step).
Thus at each step, to compute the probability of a partial translation, we simply add
the log probability of the prefix translation so far to the log probability of generating
the next token. Fig. 10.13 shows the scoring for the example sentence shown in
Fig. 10.12, using some simple made-up probabilities. Log probabilities are negative
or 0, and the max of two log probabilities is the one that is greater (closer to 0).
Figure 10.13 Scoring for beam search decoding with a beam width of k = 2. We maintain the log probability
of each hypothesis in the beam by incrementally adding the logprob of generating each next token. Only the top
k paths are extended to the next step.
y0 , h0 ← 0
path ← ()
complete paths ← ()
state ← (c, y0 , h0 , path) ;initial state
frontier ← hstatei ;initial frontier
to apply some form of length normalization to each of the hypotheses, for example
simply dividing the negative log probability by the number of words:
t
1X
score(y) = − log P(y|x) = − log P(yi |y1 , ..., yi−1 , x) (10.20)
T
i=1
Decoder
h h h h
transformer
blocks
Figure 10.15 The encoder-decoder architecture using transformer components. The encoder uses the trans-
former blocks we saw in Chapter 9, while the decoder uses a more powerful block with an extra encoder-decoder
attention layer. The final output of the encoder Henc = h1 , ..., hT is used to form the K and V inputs to the cross-
attention layer in each decoder block.
But the components of the architecture differ somewhat from the RNN and also
from the transformer block we’ve seen. First, in order to attend to the source lan-
guage, the transformer blocks in the decoder has an extra cross-attention layer.
Recall that the transformer block of Chapter 9 consists of a self-attention layer that
attends to the input from the previous layer, followed by layer norm, a feed for-
ward layer, and another layer norm. The decoder transformer block includes an
cross-attention extra layer with a special kind of attention, cross-attention (also sometimes called
encoder-decoder attention or source attention). Cross-attention has the same form
as the multi-headed self-attention in a normal transformer block, except that while
the queries as usual come from the previous layer of the decoder, the keys and values
come from the output of the encoder.
That is, the final output of the encoder Henc = h1 , ..., ht is multiplied by the
cross-attention layer’s key weights WK and value weights WV , but the output from
the prior decoder layer Hdec[i−1] is multiplied by the cross-attention layer’s query
weights WQ :
Q = WQ Hdec[i−1] ; K = WK Henc ; V = WV Henc (10.21)
QK|
CrossAttention(Q, K, V) = softmax √ V (10.22)
dk
18 C HAPTER 10 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
y1 y2 y3 yn
…
Linear Layer
Block 3
hn hn hn hn
… Block 2
Encoder Decoder
Figure 10.16 The transformer block for the encoder and the decoder. Each decoder block
has an extra cross-attention layer, which uses the output of the final encoder layer H enc =
h1 , ..., ht to produce its key and value vectors.
The cross attention thus allows the decoder to attend to each of the source language
words as projected into the the entire encoder final output representations. The other
attention layer in each decoder block, the self-attention layer, is the same causal (left-
to-right) self-attention that we saw in Chapter 9. The self-attention in the encoder,
however, is allowed to look ahead at the entire source language text.
In training, just as for RNN encoder-decoders, we use teacher forcing, and train
autoregressively, at each time step predicting the next token in the target language,
using cross-entropy loss.
10.7.1 Tokenization
Machine translation systems generally use a fixed vocabulary, A common way to
wordpiece generate this vocabulary is with the BPE or wordpiece algorithms sketched in Chap-
ter 2. Generally a shared vocabulary is used for the source and target languages,
which makes it easy to copy tokens (like names) from source to target, so we build
the wordpiece/BPE lexicon on a corpus that contains both source and target lan-
guage data. Wordpieces use a special symbol at the beginning of each token; here’s
a resulting tokenization from the Google MT system (Wu et al., 2016):
words: Jet makers feud over seat width with big orders at stake
wordpieces: J et makers fe ud over seat width with big orders at stake
We gave the BPE algorithm in detail in Chapter 2; here are more details on the
wordpiece algorithm, which is given a training corpus and a desired vocabulary size
10.7 • S OME PRACTICAL DETAILS ON BUILDING MT SYSTEMS 19
10.7.2 MT corpora
parallel corpus Machine translation models are trained on a parallel corpus, sometimes called a
bitext, a text that appears in two (or more) languages. Large numbers of paral-
Europarl lel corpora are available. Some are governmental; the Europarl corpus (Koehn,
2005), extracted from the proceedings of the European Parliament, contains between
400,000 and 2 million sentences each from 21 European languages. The United Na-
tions Parallel Corpus contains on the order of 10 million sentences in the six official
languages of the United Nations (Arabic, Chinese, English, French, Russian, Span-
ish) Ziemski et al. (2016). Other parallel corpora have been made from movie and
TV subtitles, like the OpenSubtitles corpus (Lison and Tiedemann, 2016), or from
general web text, like the ParaCrawl corpus of 223 million sentence pairs between
23 EU languages and English extracted from the CommonCrawl Bañón et al. (2020).
Sentence alignment
Standard training corpora for MT come as aligned pairs of sentences. When creating
new corpora, for example for underresourced languages or new domains, these sen-
tence alignments must be created. Fig. 10.17 gives a sample hypothetical sentence
alignment.
E1: “Good morning," said the little prince. F1: -Bonjour, dit le petit prince.
E2: “Good morning," said the merchant. F2: -Bonjour, dit le marchand de pilules perfectionnées qui
apaisent la soif.
E3: This was a merchant who sold pills that had
F3: On en avale une par semaine et l'on n'éprouve plus le
been perfected to quench thirst.
besoin de boire.
E4: You just swallow one pill a week and you F4: -C’est une grosse économie de temps, dit le marchand.
won’t feel the need for anything to drink.
E5: “They save a huge amount of time," said the merchant. F5: Les experts ont fait des calculs.
E6: “Fifty−three minutes a week." F6: On épargne cinquante-trois minutes par semaine.
E7: “If I had fifty−three minutes to spend?" said the F7: “Moi, se dit le petit prince, si j'avais cinquante-trois minutes
little prince to himself. à dépenser, je marcherais tout doucement vers une fontaine..."
E8: “I would take a stroll to a spring of fresh water”
Figure 10.17 A sample alignment between sentences in English and French, with sentences extracted from
Antoine de Saint-Exupery’s Le Petit Prince and a hypothetical translation. Sentence alignment takes sentences
e1 , ..., en , and f1 , ..., fn and finds minimal sets of sentences that are translations of each other, including single
sentence mappings like (e1 ,f1 ), (e4 ,f3 ), (e5 ,f4 ), (e6 ,f6 ) as well as 2-1 alignments (e2 /e3 ,f2 ), (e7 /e8 ,f7 ), and null
alignments (f5 ).
20 C HAPTER 10 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
Given two documents that are translations of each other, we generally need two
steps to produce sentence alignments:
• a cost function that takes a span of source sentences and a span of target sen-
tences and returns a score measuring how likely these spans are to be transla-
tions.
• an alignment algorithm that takes these scores to find a good alignment be-
tween the documents.
Since it is possible to induce multilingual sentence embeddings (Artetxe and
Schwenk, 2019), cosine similarity of such embeddings provides a natural scoring
function (Schwenk, 2018). Thompson and Koehn (2019) give the following cost
function between two sentences or spans x,y from the source and target documents
respectively:
(1 − cos(x, y))nSents(x) nSents(y)
c(x, y) = PS PS (10.23)
s=1 1 − cos(x, ys ) + s=1 1 − cos(xs , y)
where nSents() gives the number of sentences (this biases the metric toward many
alignments of single sentences instead of aligning very large spans). The denom-
inator helps to normalize the similarities, and so x1 , ..., xS , y1 , ..., yS , are randomly
selected sentences sampled from the respective documents.
Usually dynamic programming is used as the alignment algorithm (Gale and
Church, 1993), in a simple extension of the minimum edit distance algorithm we
introduced in Chapter 2.
Finally, it’s helpful to do some corpus cleanup by removing noisy sentence pairs.
This can involve handwritten rules to remove low-precision pairs (for example re-
moving sentences that are too long, too short, have different URLs, or even pairs
that are too similar, suggesting that they were copies rather than translations). Or
pairs can be ranked by their multilingual embedding cosine score and low-scoring
pairs discarded.
10.7.3 Backtranslation
We’re often short of data for training MT models, since parallel corpora may be
limited for particular languages or domains. However, often we can find a large
monolingual corpus, to add to the smaller parallel corpora that are available.
backtranslation Backtranslation is a way of making use of monolingual corpora in the target
language by creating synthetic bitexts. In backtranslation, we train an intermediate
target-to-source MT system on the small bitext to translate the monolingual target
data to the source language. Now we can add this synthetic bitext (natural target
sentences, aligned with MT-produced source sentences) to our training data, and
retrain our source-to-target MT model. For example suppose we want to translate
from Navajo to English but only have a small Navajo-English bitext, although of
course we can find lots of monolingual English data. We use the small bitext to build
an MT engine going the other way (from English to Navajo). Once we translate the
monolingual English text to Navajo, we can add this synthetic Navajo/English bitext
to our training data.
Backtranslation has various parameters. One is how we generate the backtrans-
lated data; we can run the decoder in greedy inference, or use beam search. Or
Monte Carlo we can do sampling, or Monte Carlo search. In Monte Carlo decoding, at each
search
timestep, instead of always generating the word with the highest softmax proba-
bility, we roll a weighted die, and use it to choose the next word according to its
10.8 • MT E VALUATION 21
softmax probability. This works just like the sampling algorithm we saw in Chap-
ter 3 for generating random sentences from n-gram language models. Imagine there
are only 4 words and the softmax probability distribution at time t is (the: 0.6, green:
0.2, a: 0.1, witch: 0.1). We roll a weighted die, with the 4 sides weighted 0.6, 0.2,
0.1, and 0.1, and chose the word based on which side comes up. Another parameter
is the ratio of backtranslated data to natural bitext data; we can choose to upsample
the bitext data (include multiple copies of each sentence).
In general backtranslation works surprisingly well; one estimate suggests that a
system trained on backtranslated text gets about 2/3 of the gain as would training on
the same amount of natural bitext (Edunov et al., 2018).
10.8 MT Evaluation
Translations are evaluated along two dimensions:
adequacy 1. adequacy: how well the translation captures the exact meaning of the source
sentence. Sometimes called faithfulness or fidelity.
fluency 2. fluency: how fluent the translation is in the target language (is it grammatical,
clear, readable, natural).
Using humans to evaluate is most accurate, but automatic metrics are also used for
convenience.
Next let’s see how many unigrams and bigrams match between the reference and
hypothesis:
unigrams that match: w i t n e s s f o t h e p a s t , (17 unigrams)
bigrams that match: wi it tn ne es ss th he ep pa as st t, (13 bigrams)
We use that to compute the unigram and bigram precisions and recalls:
unigram P: 17/17 = 1 unigram R: 17/18 = .944
bigram P: 13/16 = .813 bigram R: 13/17 = .765
Finally we average to get chrP and chrR, and compute the F-score:
chrF: Limitations
While automatic character and word-overlap metrics like chrF or BLEU are useful,
they have important limitations. chrF is very local: a large phrase that is moved
around might barely change the chrF score at all, and chrF can’t evaluate cross-
sentence properties of a document like its discourse coherence (Chapter 22). chrF
and similar automatic metrics also do poorly at comparing very different kinds of
systems, such as comparing human-aided translation against machine translation, or
24 C HAPTER 10 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
1 X 1 X
RBERT = max xi · x̃ j PBERT = max xi · x̃ j (10.25)
|x| x ∈x x̃ j ∈x̃ |x̃| x̃ ∈x̃ xi ∈x
i j
Published as a conference paper at ICLR 2020
7.94
the weather is
Reference
1.82
cold today (0.713 1.27)+(0.515 7.94)+...
7.90
RBERT =
<latexit sha1_base64="fGWl4NCvlvtMu17rjLtk25oWpdc=">AAACSHicbZBLS+RAFIUrPT7bVzsu3RQ2ghIIqVbpuBgQRZiVqNgqdJpQqa5oYeVB1Y1ME/Lz3Lic3fwGNy6UwZ2VNgtfBwoO372Xe+uEmRQaXPef1fgxMTk1PTPbnJtfWFxqLf8812muGO+xVKbqMqSaS5HwHgiQ/DJTnMah5BfhzUFVv7jlSos0OYNRxgcxvUpEJBgFg4JWcBoUPvA/UOwfnp6VJf6F/UhRVmy4Tpds+SBirjFxOt1N26AdslOjrrO7vWn7cpiCLouqwa6QTRyvUznX9hzPK4NW23XcsfBXQ2rTRrWOg9Zff5iyPOYJMEm17hM3g0FBFQgmedn0c80zym7oFe8bm1BzzKAYB1HidUOGOEqVeQngMX0/UdBY61Ecms6YwrX+XKvgd7V+DpE3KESS5cAT9rYoyiWGFFep4qFQnIEcGUOZEuZWzK6pyRFM9k0TAvn85a/mvOMQ1yEnpL13VMcxg1bRGtpABHXRHvqNjlEPMXSHHtATerburUfrv/Xy1tqw6pkV9EGNxisxMKq0</latexit>
sha1_base64="OJyoKlmBAgUA0KDtUcsH/di5BlI=">AAACSHicbZDLattAFIaPnLRJ3JvTLrsZYgoJAqFxGqwsCqal0FVJQ5wELCNG41EyZHRh5ijECL1EnqAv002X2eUZsumipXRR6Mj2Ipf+MPDznXM4Z/64UNKg7187raXlR49XVtfaT54+e/6is/7y0OSl5mLIc5Xr45gZoWQmhihRieNCC5bGShzFZx+a+tG50Ebm2QFOCzFO2UkmE8kZWhR1ov2oClFcYPX+4/5BXZN3JEw049Wm7/XpdogyFYZQr9ffci3aoTsL1Pd23265oZrkaOqqaXAb5FIv6DXOdwMvCOqo0/U9fyby0NCF6Q52/15+BYC9qHMVTnJepiJDrpgxI+oXOK6YRsmVqNthaUTB+Bk7ESNrM2aPGVezIGryxpIJSXJtX4ZkRm9PVCw1ZprGtjNleGru1xr4v9qoxCQYVzIrShQZny9KSkUwJ02qZCK14Kim1jCupb2V8FNmc0SbfduGQO9/+aE57HnU9+gX2h18hrlW4TVswCZQ6MMAPsEeDIHDN7iBn/DL+e78cH47f+atLWcx8wruqNX6B8dUrVw=</latexit>
sha1_base64="RInTcZkWiVBnf/ncBstCvatCtG4=">AAACSHicbZDPShxBEMZ7Nproxugaj14al4AyMEyvyoyHwGIQPImKq8LOMvT09mhjzx+6a0KWYV4iL5EnySXH3HwGLx4U8SDYs7sHo/mg4eNXVVT1F+VSaHDda6vxbmb2/Ye5+ebHhU+LS63lz6c6KxTjPZbJTJ1HVHMpUt4DAZKf54rTJJL8LLr6VtfPvnOlRZaewCjng4RepCIWjIJBYSs8DssA+A8od/eOT6oKf8VBrCgr113HI5sBiIRrTJyOt2EbtE22p8hzdrY27EAOM9BVWTfYNbKJ43dq59q+4/tV2Gq7jjsWfmvI1LS7O08/f3nLi4dh628wzFiR8BSYpFr3iZvDoKQKBJO8agaF5jllV/SC941NqTlmUI6DqPAXQ4Y4zpR5KeAxfTlR0kTrURKZzoTCpX5dq+H/av0CYn9QijQvgKdssiguJIYM16nioVCcgRwZQ5kS5lbMLqnJEUz2TRMCef3lt+a04xDXIUek3T1AE82hVbSG1hFBHuqifXSIeoih3+gG3aF76491az1Yj5PWhjWdWUH/qNF4BkPYrbk=</latexit>
1.27+7.94+1.82+7.90+8.88
Candidate
Figure 1: Illustration of the computation of the recall metric R BERT . Given the reference x and
Figure 10.18
candidate The computation
x̂, we compute of BERTS
BERT embeddings pairwiserecall
and CORE cosinefrom reference
similarity. x and candidate
We highlight the greedy x̂,
from Figure 1 in Zhang et al. (2020). This version shows an
matching in red, and include the optional idf importance weighting. extended version of the metric in
which tokens are also weighted by their idf values.
We experiment with different models (Section 4), using the tokenizer provided with each model.
Given a tokenized reference sentence x = hx1 , . . . , xk i, the embedding model generates a se-
quence of vectors hx1 , . . . , xk i. Similarly, the tokenized candidate x̂ = hx̂1 , . . . , x̂m i is mapped
to hx̂1 , . . . , x̂l i. The main model we use is BERT, which tokenizes the input text into a sequence
of word pieces (Wu et al., 2016), where unknown words are split into several commonly observed
sequences of characters. The representation for each word piece is computed with a Transformer
encoder (Vaswani et al., 2017) by repeatedly applying self-attention and nonlinear transformations
in an alternating fashion. BERT embeddings have been shown to benefit various NLP tasks (Devlin
et al., 2019; Liu, 2019; Huang et al., 2019; Yang et al., 2019a).
Similarity Measure The vector representation allows for a soft measure of similarity instead of
exact-string (Papineni et al., 2002) or heuristic (Banerjee & Lavie, 2005) matching. The cosine
10.9 • B IAS AND E THICAL I SSUES 25
Similarly, a recent challenge set, the WinoMT dataset (Stanovsky et al., 2019)
shows that MT systems perform worse when they are asked to translate sentences
that describe people with non-stereotypical gender roles, like “The doctor asked the
nurse to help her in the operation”.
Many ethical questions in MT require further research. One open problem is
developing metrics for knowing what our systems don’t know. This is because MT
systems can be used in urgent situations where human translators may be unavailable
or delayed: in medical domains, to help translate when patients and doctors don’t
speak the same language, or in legal domains, to help judges or lawyers communi-
cate with witnesses or defendants. In order to ‘do no harm’, systems need ways to
confidence assign confidence values to candidate translations, so they can abstain from giving
incorrect translations that may cause harm.
Another is the need for low-resource algorithms that can translate to and from
all the world’s languages, the vast majority of which do not have large parallel train-
ing texts available. This problem is exacerbated by the tendency of many MT ap-
proaches to focus on the case where one of the languages is English (Anastasopou-
los and Neubig, 2020). ∀ et al. (2020) propose a participatory design process to
encourage content creators, curators, and language technologists who speak these
low-resourced
languages low-resourced languages to participate in developing MT algorithms. They pro-
vide online groups, mentoring, and infrastructure, and report on a case study on
developing MT algorithms for low-resource African languages.
26 C HAPTER 10 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
10.10 Summary
Machine translation is one of the most widely used applications of NLP, and the
encoder-decoder model, first developed for MT is a key tool that has applications
throughout NLP.
• Languages have divergences, both structural and lexical, that make translation
difficult.
• The linguistic field of typology investigates some of these differences; lan-
guages can be classified by their position along typological dimensions like
whether verbs precede their objects.
• Encoder-decoder networks (either for RNNs or transformers) are composed
of an encoder network that takes an input sequence and creates a contextu-
alized representation of it, the context. This context representation is then
passed to a decoder which generates a task-specific output sequence.
• The attention mechanism in RNNs, and cross-attention in transformers, al-
lows the decoder to view information from all the hidden states of the encoder.
• For the decoder, choosing the single most probable token to generate at each
step is called greedy decoding.
• In beam search, instead of choosing the best token to generate at each timestep,
we keep k possible tokens at each step. This fixed-size memory footprint k is
called the beam width.
• Machine translation models are trained on a parallel corpus, sometimes called
a bitext, a text that appears in two (or more) languages.
• Backtranslation is a way of making use of monolingual corpora in the target
language by running a pilot MT engine backwards to create synthetic bitexts.
• MT is evaluated by measuring a translation’s adequacy (how well it captures
the meaning of the source sentence) and fluency (how fluent or natural it is
in the target language). Human evaluation is the gold standard, but automatic
evaluation metrics like chrF, which measure character n-gram overlap with
human translations, or more recent metrics based on embedding similarity,
are also commonly used.
in the US. As MT research lost academic respectability, the Association for Ma-
chine Translation and Computational Linguistics dropped MT from its name. Some
MT developers, however, persevered, and there were early MT systems like Météo,
which translated weather forecasts from English to French (Chandioux, 1976), and
industrial systems like Systran.
In the early years, the space of MT architectures spanned three general mod-
els. In direct translation, the system proceeds word-by-word through the source-
language text, translating each word incrementally. Direct translation uses a large
bilingual dictionary, each of whose entries is a small program with the job of trans-
lating one word. In transfer approaches, we first parse the input text and then ap-
ply rules to transform the source-language parse into a target language parse. We
then generate the target language sentence from the parse tree. In interlingua ap-
proaches, we analyze the source language text into some abstract meaning repre-
sentation, called an interlingua. We then generate into the target language from
this interlingual representation. A common way to visualize these three early ap-
Vauquois
triangle proaches was the Vauquois triangle shown in Fig. 10.20. The triangle shows the
increasing depth of analysis required (on both the analysis and generation end) as
we move from the direct approach through transfer approaches to interlingual ap-
proaches. In addition, it shows the decreasing amount of transfer knowledge needed
as we move up the triangle, from huge amounts of transfer at the direct level (al-
most all knowledge is transfer knowledge for each word) through transfer (transfer
rules only for parse trees or thematic roles) through interlingua (no specific transfer
knowledge). We can view the encoder-decoder network as an interlingual approach,
with attention acting as an integration of direct and transfer, allowing words or their
representations to be directly accessed by the decoder.
Interlingua
sis age
ta gen
aly gu
rg
an la n
et era
Source Text:
Target Text:
lan tion
ce
Semantic/Syntactic
Transfer Semantic/Syntactic
ur
gu
Structure
so
ag
Structure
e
Statistical methods began to be applied around 1990, enabled first by the devel-
opment of large bilingual corpora like the Hansard corpus of the proceedings of the
Canadian Parliament, which are kept in both French and English, and then by the
growth of the Web. Early on, a number of researchers showed that it was possible
to extract pairs of aligned sentences from bilingual corpora, using words or simple
cues like sentence length (Kay and Röscheisen 1988, Gale and Church 1991, Gale
and Church 1993, Kay and Röscheisen 1993).
At the same time, the IBM group, drawing directly on the noisy channel model
statistical MT for speech recognition, proposed two related paradigms for statistical MT. These
IBM Models include the generative algorithms that became known as IBM Models 1 through
Candide 5, implemented in the Candide system. The algorithms (except for the decoder)
were published in full detail— encouraged by the US government who had par-
28 C HAPTER 10 • M ACHINE T RANSLATION AND E NCODER -D ECODER M ODELS
tially funded the work— which gave them a huge impact on the research community
(Brown et al. 1990, Brown et al. 1993). The group also developed a discriminative
approach, called MaxEnt (for maximum entropy, an alternative formulation of logis-
tic regression), which allowed many features to be combined discriminatively rather
than generatively (Berger et al., 1996), which was further developed by Och and Ney
(2002).
By the turn of the century, most academic research on machine translation used
phrase-based statistical MT. An extended approach, called phrase-based translation was devel-
translation
oped, based on inducing translations for phrase-pairs (Och 1998, Marcu and Wong
2002, Koehn et al. (2003), Och and Ney 2004, Deng and Byrne 2005, inter alia).
Once automatic metrics like BLEU were developed (Papineni et al., 2002), use log
linear formulation (Och and Ney, 2004) to directly optimize evaluation metrics like
MERT BLEU in a method known as Minimum Error Rate Training, or MERT (Och,
2003), also drawing from speech recognition models (Chou et al., 1993). Toolkits
Moses like GIZA (Och and Ney, 2003) and Moses (Koehn et al. 2006, Zens and Ney 2007)
were widely used.
There were also approaches around the turn of the century that were based on
transduction
grammars syntactic structure (Chapter 12). Models based on transduction grammars (also
called synchronous grammars assign a parallel syntactic tree structure to a pair of
sentences in different languages, with the goal of translating the sentences by ap-
plying reordering operations on the trees. From a generative perspective, we can
view a transduction grammar as generating pairs of aligned sentences in two lan-
inversion
guages. Some of the most widely used models included the inversion transduction
transduction
grammar
grammar (Wu, 1996) and synchronous context-free grammars (Chiang, 2005),
Neural networks had been applied at various times to various aspects of machine
translation; for example Schwenk et al. (2006) showed how to use neural language
models to replace n-gram language models in a Spanish-English system based on
IBM Model 4. The modern neural encoder-decoder approach was pioneered by
Kalchbrenner and Blunsom (2013), who used a CNN encoder and an RNN decoder.
Cho et al. (2014) (who coined the name “encoder-decoder”) and Sutskever et al.
(2014) then showed how to use extended RNNs for both encoder and decoder. The
idea that a generative decoder should take as input a soft weighting of the inputs,
the central idea of attention, was first developed by Graves (2013) in the context
of handwriting recognition. Bahdanau et al. (2015) extended the idea, named it
“attention” and applied it to MT. The transformer encoder-decoder was proposed by
Vaswani et al. (2017) (see the History section of Chapter 9).
Beam-search has an interesting relationship with human language processing;
(Meister et al., 2020) show that beam search enforces the cognitive property of uni-
form information density in text. Uniform information density is the hypothe-
sis that human language processors tend to prefer to distribute information equally
across the sentence (Jaeger and Levy, 2007).
Research on evaluation of machine translation began quite early. Miller and
Beebe-Center (1956) proposed a number of methods drawing on work in psycholin-
guistics. These included the use of cloze and Shannon tasks to measure intelligibility
as well as a metric of edit distance from a human translation, the intuition that un-
derlies all modern overlap-based automatic evaluation metrics. The ALPAC report
included an early evaluation study conducted by John Carroll that was extremely in-
fluential (Pierce et al., 1966, Appendix 10). Carroll proposed distinct measures for
fidelity and intelligibility, and had raters score them subjectively on 9-point scales.
Much early evaluation work focuses on automatic word-overlap metrics like BLEU
E XERCISES 29
(Papineni et al., 2002), NIST (Doddington, 2002), TER (Translation Error Rate)
(Snover et al., 2006), Precision and Recall (Turian et al., 2003), and METEOR
(Banerjee and Lavie, 2005); character n-gram overlap methods like chrF (Popović,
2015) came later. More recent evaluation work, echoing the ALPAC report, has
emphasized the importance of careful statistical methodology and the use of human
evaluation (Kocmi et al., 2021; Marie et al., 2021).
The early history of MT is surveyed in Hutchins 1986 and 1997; Nirenburg et al.
(2002) collects early readings. See Croft (1990) or Comrie (1989) for introductions
to linguistic typology.
Exercises
10.1 Compute by hand the chrF2,2 score for HYP2 on page 22 (the answer should
round to .62).
30 Chapter 10 • Machine Translation and Encoder-Decoder Models
Anastasopoulos, A. and G. Neubig. 2020. Should all cross- Dostert, L. 1955. The Georgetown-I.B.M. experiment. In
lingual embeddings speak English? ACL. Machine Translation of Languages: Fourteen Essays,
Artetxe, M. and H. Schwenk. 2019. Massively multilingual pages 124–135. MIT Press.
sentence embeddings for zero-shot cross-lingual transfer Dryer, M. S. and M. Haspelmath, editors. 2013. The World
and beyond. TACL, 7:597–610. Atlas of Language Structures Online. Max Planck Insti-
Bahdanau, D., K. H. Cho, and Y. Bengio. 2015. Neural ma- tute for Evolutionary Anthropology, Leipzig. Available
chine translation by jointly learning to align and translate. online at [Link]
ICLR 2015. Edunov, S., M. Ott, M. Auli, and D. Grangier. 2018. Under-
Banerjee, S. and A. Lavie. 2005. METEOR: An automatic standing back-translation at scale. EMNLP.
metric for MT evaluation with improved correlation with ∀, W. Nekoto, V. Marivate, T. Matsila, T. Fasubaa,
human judgments. Proceedings of ACL Workshop on In- T. Kolawole, T. Fagbohungbe, S. O. Akinola, S. H.
trinsic and Extrinsic Evaluation Measures for MT and/or Muhammad, S. Kabongo, S. Osei, S. Freshia, R. A.
Summarization. Niyongabo, R. M. P. Ogayo, O. Ahia, M. Meressa,
Bañón, M., P. Chen, B. Haddow, K. Heafield, H. Hoang, M. Adeyemi, M. Mokgesi-Selinga, L. Okegbemi, L. J.
M. Esplà-Gomis, M. L. Forcada, A. Kamran, F. Kirefu, Martinus, K. Tajudeen, K. Degila, K. Ogueji, K. Siminyu,
P. Koehn, S. Ortiz Rojas, L. Pla Sempere, G. Ramı́rez- J. Kreutzer, J. Webster, J. T. Ali, J. A. I. Orife,
Sánchez, E. Sarrı́as, M. Strelec, B. Thompson, W. Waites, I. Ezeani, I. A. Dangana, H. Kamper, H. Elsahar, G. Duru,
D. Wiggins, and J. Zaragoza. 2020. ParaCrawl: Web- G. Kioko, E. Murhabazi, E. van Biljon, D. Whitenack,
scale acquisition of parallel corpora. ACL. C. Onyefuluchi, C. Emezue, B. Dossou, B. Sibanda, B. I.
Bassey, A. Olabiyi, A. Ramkilowan, A. Öktem, A. Ak-
Bar-Hillel, Y. 1960. The present status of automatic transla- infaderin, and A. Bashir. 2020. Participatory research
tion of languages. In F. Alt, editor, Advances in Comput- for low-resourced machine translation: A case study in
ers 1, pages 91–163. Academic Press. African languages. Findings of EMNLP.
Berger, A., S. A. Della Pietra, and V. J. Della Pietra. 1996. A Gale, W. A. and K. W. Church. 1991. A program for aligning
maximum entropy approach to natural language process- sentences in bilingual corpora. ACL.
ing. Computational Linguistics, 22(1):39–71.
Gale, W. A. and K. W. Church. 1993. A program for aligning
Bickel, B. 2003. Referential density in discourse and syntac- sentences in bilingual corpora. Computational Linguis-
tic typology. Language, 79(2):708–736. tics, 19:75–102.
Brown, P. F., J. Cocke, S. A. Della Pietra, V. J. Della Pietra, Graves, A. 2013. Generating sequences with recurrent neural
F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. networks. ArXiv.
1990. A statistical approach to machine translation. Com-
putational Linguistics, 16(2):79–85. Hutchins, W. J. 1986. Machine Translation: Past, Present,
Future. Ellis Horwood, Chichester, England.
Brown, P. F., S. A. Della Pietra, V. J. Della Pietra, and R. L.
Mercer. 1993. The mathematics of statistical machine Hutchins, W. J. 1997. From first conception to first demon-
translation: Parameter estimation. Computational Lin- stration: The nascent years of machine translation, 1947–
guistics, 19(2):263–311. 1954. A chronology. Machine Translation, 12:192–252.
Hutchins, W. J. and H. L. Somers. 1992. An Introduction to
Callison-Burch, C., M. Osborne, and P. Koehn. 2006. Re-
Machine Translation. Academic Press.
evaluating the role of BLEU in machine translation re-
search. EACL. Jaeger, T. F. and R. P. Levy. 2007. Speakers optimize infor-
mation density through syntactic reduction. NeurIPS.
Chandioux, J. 1976. M ÉT ÉO: un système opérationnel pour
la traduction automatique des bulletins météorologiques Kalchbrenner, N. and P. Blunsom. 2013. Recurrent continu-
destinés au grand public. Meta, 21:127–133. ous translation models. EMNLP.
Chiang, D. 2005. A hierarchical phrase-based model for sta- Kay, M. and M. Röscheisen. 1988. Text-translation align-
tistical machine translation. ACL. ment. Technical Report P90-00143, Xerox Palo Alto Re-
search Center, Palo Alto, CA.
Cho, K., B. van Merriënboer, C. Gulcehre, D. Bahdanau,
F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learn- Kay, M. and M. Röscheisen. 1993. Text-translation align-
ing phrase representations using RNN encoder–decoder ment. Computational Linguistics, 19:121–142.
for statistical machine translation. EMNLP. Kocmi, T., C. Federmann, R. Grundkiewicz, M. Junczys-
Chou, W., C.-H. Lee, and B. H. Juang. 1993. Minimum error Dowmunt, H. Matsushita, and A. Menezes. 2021. To ship
rate training based on n-best string models. ICASSP. or not to ship: An extensive evaluation of automatic met-
rics for machine translation. ArXiv.
Comrie, B. 1989. Language Universals and Linguistic Ty-
pology, 2nd edition. Blackwell. Koehn, P. 2005. Europarl: A parallel corpus for statistical
machine translation. MT summit, vol. 5.
Croft, W. 1990. Typology and Universals. Cambridge Uni-
versity Press. Koehn, P., H. Hoang, A. Birch, C. Callison-Burch, M. Fed-
erico, N. Bertoldi, B. Cowan, W. Shen, C. Moran,
Deng, Y. and W. Byrne. 2005. HMM word and phrase align-
R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst.
ment for statistical machine translation. HLT-EMNLP.
2006. Moses: Open source toolkit for statistical machine
Doddington, G. 2002. Automatic evaluation of machine translation. ACL.
translation quality using n-gram co-occurrence statistics.
Koehn, P., F. J. Och, and D. Marcu. 2003. Statistical phrase-
HLT.
based translation. HLT-NAACL.
Exercises 31
Lison, P. and J. Tiedemann. 2016. Opensubtitles2016: Ex- Sellam, T., D. Das, and A. Parikh. 2020. BLEURT: Learning
tracting large parallel corpora from movie and tv subti- robust metrics for text generation. ACL.
tles. LREC. Slobin, D. I. 1996. Two ways to travel. In M. Shibatani
Marcu, D. and W. Wong. 2002. A phrase-based, joint proba- and S. A. Thompson, editors, Grammatical Construc-
bility model for statistical machine translation. EMNLP. tions: Their Form and Meaning, pages 195–220. Claren-
Marie, B., A. Fujita, and R. Rubino. 2021. Scientific credi- don Press.
bility of machine translation research: A meta-evaluation Snover, M., B. Dorr, R. Schwartz, L. Micciulla, and
of 769 papers. ACL 2021. J. Makhoul. 2006. A study of translation edit rate with
McLuhan, M. 1964. Understanding Media: The Extensions targeted human annotation. AMTA-2006.
of Man. New American Library. Stanovsky, G., N. A. Smith, and L. Zettlemoyer. 2019. Eval-
Meister, C., T. Vieira, and R. Cotterell. 2020. If beam search uating gender bias in machine translation. ACL.
is the answer, what was the question? EMNLP. Sutskever, I., O. Vinyals, and Q. V. Le. 2014. Sequence to
Miller, G. A. and J. G. Beebe-Center. 1956. Some psycho- sequence learning with neural networks. NeurIPS.
logical methods for evaluating the quality of translations. Talmy, L. 1985. Lexicalization patterns: Semantic structure
Mechanical Translation, 3:73–80. in lexical forms. In T. Shopen, editor, Language Typology
Nirenburg, S., H. L. Somers, and Y. Wilks, editors. 2002. and Syntactic Description, Volume 3. Cambridge Univer-
Readings in Machine Translation. MIT Press. sity Press. Originally appeared as UC Berkeley Cognitive
Science Program Report No. 30, 1980.
Och, F. J. 1998. Ein beispielsbasierter und statis-
tischer Ansatz zum maschinellen Lernen von Talmy, L. 1991. Path to realization: A typology of event
natürlichsprachlicher Übersetzung. Ph.D. thesis, Uni- conflation. BLS-91.
versität Erlangen-Nürnberg, Germany. Diplomarbeit Thompson, B. and P. Koehn. 2019. Vecalign: Improved sen-
(diploma thesis). tence alignment in linear time and space. EMNLP.
Och, F. J. 2003. Minimum error rate training in statistical Turian, J. P., L. Shen, and I. D. Melamed. 2003. Evaluation
machine translation. ACL. of machine translation and its evaluation. Proceedings of
Och, F. J. and H. Ney. 2002. Discriminative training and MT Summit IX.
maximum entropy models for statistical machine transla- Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
tion. ACL. A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Atten-
Och, F. J. and H. Ney. 2003. A systematic comparison of tion is all you need. NeurIPS.
various statistical alignment models. Computational Lin- Vauquois, B. 1968. A survey of formal grammars and al-
guistics, 29(1):19–51. gorithms for recognition and transformation in machine
Och, F. J. and H. Ney. 2004. The alignment template ap- translation. IFIP Congress 1968.
proach to statistical machine translation. Computational Weaver, W. 1949/1955. Translation. In W. N. Locke
Linguistics, 30(4):417–449. and A. D. Boothe, editors, Machine Translation of Lan-
Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu. 2002. Bleu: guages, pages 15–23. MIT Press. Reprinted from a mem-
A method for automatic evaluation of machine transla- orandum written by Weaver in 1949.
tion. ACL. Wu, D. 1996. A polynomial-time algorithm for statistical
Pierce, J. R., J. B. Carroll, E. P. Hamp, D. G. Hays, C. F. machine translation. ACL.
Hockett, A. G. Oettinger, and A. J. Perlis. 1966. Lan- Wu, Y., M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
guage and Machines: Computers in Translation and Lin- W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,
guistics. ALPAC report. National Academy of Sciences, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser,
National Research Council, Washington, DC. S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens,
Popović, M. 2015. chrF: character n-gram F-score for auto- G. Kurian, N. Patil, W. Wang, C. Young, J. Smith,
matic MT evaluation. Proceedings of the Tenth Workshop J. Riesa, A. Rudnick, O. Vinyals, G. S. Corrado,
on Statistical Machine Translation. M. Hughes, and J. Dean. 2016. Google’s neural machine
translation system: Bridging the gap between human and
Prates, M. O. R., P. H. Avelar, and L. C. Lamb. 2019. Assess- machine translation. ArXiv preprint arXiv:1609.08144.
ing gender bias in machine translation: a case study with
Google Translate. Neural Computing and Applications, Zens, R. and H. Ney. 2007. Efficient phrase-table represen-
32:6363–6381. tation for machine translation with applications to online
MT and speech translation. NAACL-HLT.
Rei, R., C. Stewart, A. C. Farinha, and A. Lavie.
2020. COMET: A neural framework for MT evaluation. Zhang, T., V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi.
EMNLP. 2020. Bertscore: Evaluating text generation with BERT.
ICLR 2020.
Schiebinger, L. 2014. Scientific research must take gender
into account. Nature, 507(7490):9. Ziemski, M., M. Junczys-Dowmunt, and B. Pouliquen. 2016.
The United Nations parallel corpus v1.0. LREC.
Schwenk, H. 2018. Filtering and mining parallel data in a
joint multilingual space. ACL.
Schwenk, H., D. Dechelotte, and J.-L. Gauvain. 2006. Con-
tinuous space language models for statistical machine
translation. COLING/ACL.