CS 4650/7650
Machine Translation 1
Jacob Eisenstein1
November 30, 2015
1
with slides from David Chiang, Chris Dyer, and Phillip Koehn
Overview of machine translation
Why is MT hard?
Word order
I English: subject-verb-object
I Japanese: subject-object-verb
I Examples
I English: IBM bought Lotus
Japanese: IBM Lotus bought
Word order
I English: subject-verb-object
I Japanese: subject-object-verb
I Examples
I English: IBM bought Lotus
Japanese: IBM Lotus bought
I French: subject-verb-object... except for pronouns
I English: I will buy it
French: Je vais l’acheter (I will it buy)
I English: I bought it
French: Je l’ai acheté (I it have bought)
I How many orderings are there?
Word sense ambiguity
I We’ve already talked about how bill translates as pico (bird)
or cuenta (cost).
Word sense ambiguity
I We’ve already talked about how bill translates as pico (bird)
or cuenta (cost).
I Legs and feet in English and French:
Word sense ambiguity
I We’ve already talked about how bill translates as pico (bird)
or cuenta (cost).
I Legs and feet in English and French:
I Lexical gaps are when a language lacks a word for a concept.
I My favorite English lexical gap: schadenfreude, a German
word for “pleasure taken from the misfortune of others.”
Pronouns
Pronoun morphology can convey different information in each
language:
I English possessive pronouns take the gender of the owner:
Marie rides her bike
I French possessive pronouns take the gender of the object:
Marie monte sur son vélo
Pronouns
Pronoun morphology can convey different information in each
language:
I English possessive pronouns take the gender of the owner:
Marie rides her bike
I French possessive pronouns take the gender of the object:
Marie monte sur son vélo
I Translating into English requires:
I Anaphora resolution: son to Marie.
I Gender determination: Marie is female.
I Google Translate’s treatment of this one is interesting.
Compare Marie monte son vélo versus Marie vend son vélo.
Pronouns
Many languages (Spanish, Japanese, Chinese, Turkish, Arabic etc.)
can drop pronouns.
I In Spanish, you can recover the pronoun from verb inflection:
Vivimos en Atlanta → We live in Atlanta
I Again, discourse context is often crucial:
Vive en Atlanta → She/he/it lives in Atlanta
Chinese example:
Tense and case
I Spanish has two past tenses.
I The preterite tense is for events with a definite time, e.g.
I biked to work this morning
I The imperfect is for events with indefinite times, e.g.
I biked to work all last summer
I To translate English to Spanish, we must pick the right tense.
This seems to require some deeper semantic understanding.
Tense and case
I Spanish has two past tenses.
I The preterite tense is for events with a definite time, e.g.
I biked to work this morning
I The imperfect is for events with indefinite times, e.g.
I biked to work all last summer
I To translate English to Spanish, we must pick the right tense.
This seems to require some deeper semantic understanding.
I Many languages have richer case morphology than English;
translating to these languages requires identifying whether the
word should be in the accusative, dative, etc.
Idioms
I Why in the world
I Kick the bucket
I Lend me your ears
I ...
The Vauquois Triangle
Outline
The noisy channel
Estimation and alignment
Phrase-based translation
Decoding
Minimum Error Rate Training
Syntactic Machine Translation
Evaluation
Practicalities
Beyond Parallel Sentences
The noisy channel model
I Remember the noisy channel model?2
I This is a general framework for thinking about translation
(and many other NLP problems).
2
next few slides from Chris Dyer
One naturally wonders if the problem
of translation could conceivably be
treated as a problem in cryptography.
When I look at an article in Russian, I
say: ‘This is really written in
English, but it has been coded in
some strange symbols. I will now
proceed to decode.’
Warren Weaver to Norbert Wiener, March, 1947
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)
y � = arg max p(y|x)
y
p(x|y)p(y)
= arg max
y p(x)
= arg max p(x|y)p(y)
y
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)
�=
y � = arg max p(y|x)
y
p(x|y)p(y)
= arg max
y p(x)
= arg max p(x|y)p(y)
y
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)
�=
y � = arg max p(y|x)
y
p(x|y)p(y)
I can help.
= arg max
y p(x)
= arg max p(x|y)p(y)
y
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
y � = arg max p(y|x)
y
p(x|y)p(y)
= arg max
y p(x)
= arg max p(x|y)p(y)
y
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
y � = arg max p(y|x)
y
p(x|y)p(y)
= arg max
y p(x)
= arg max p(x|y)p(y)
Denominator doesn’t
y depend on y .
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
y � = arg max p(y|x)
y
p(x|y)p(y)
= arg max
y p(x)
= arg max p(x|y)p(y)
y
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
y � = arg max p(x|y)p(y)
y
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
English “French” English’
y � = arg max p(x|y)p(y)
y
e� = arg max p(f|e)p(e)
e
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
English “French” English’
y � = arg max p(x|y)p(y)
y
e� = arg max p(f|e)p(e)
e
translation model
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
English “French” English’
y � = arg max p(x|y)p(y)
y
e� = arg max p(f|e)p(e)
e
translation model language model
Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
English “French” English’
y � = arg max p(x|y)p(y)
y
e� = arg max p(f|e)p(e)
e
translation model language model
Other noisy channel applications: OCR, speech
recognition, spelling correction...
Friday, July 1, 2011
Division of labor
• Translation model
• probability of translation back into the
source
• ensures adequacy of translation
• Language model
• is a translation hypothesis “good” English?
• ensures fluency of translation
Friday, July 1, 2011
English
S'il vous plaît traduire...
learner
decoder LM
p(e)
�
e = arg max p(e | f) English français
e
TM
p(f | e)
learner
Please translate...
Friday, July 1, 2011
Outline
The noisy channel
Estimation and alignment
Phrase-based translation
Decoding
Minimum Error Rate Training
Syntactic Machine Translation
Evaluation
Practicalities
Beyond Parallel Sentences
Estimating the components
I We’ve already learned how to estimate language models.
I Google uses n-grams of order 5-7.
I Smoothing is really important;
so is being clever about decoding.
Estimating the components
I We’ve already learned how to estimate language models.
I Google uses n-grams of order 5-7.
I Smoothing is really important;
so is being clever about decoding.
I Estimating the translation model is more difficult.
I e = And the program was implemented
I f = La programmation a été mise en application
Estimating the components
I We’ve already learned how to estimate language models.
I Google uses n-grams of order 5-7.
I Smoothing is really important;
so is being clever about decoding.
I Estimating the translation model is more difficult.
I e = And the program was implemented
I f = La programmation a été mise en application
I P(f |e ) is really hard to define.
I Easier: something like
P(la|the) × P(programme|programmation) × ...
I But we are given aligned sentences, not aligned words.
4
Alignment
• In a parallel text (or when we translate), we align words in one language with
the words in the other
1 2 3 4
das Haus ist klein
the house is small
1 2 3 4
• Word positions are numbered 1–4
Barry Haddow Machine Translation 6 February 2012
5
Alignment function
• Formalizing alignment with an alignment function
• Mapping an English target word at position i to a German source word at
position j with a function a : i → j
• Example
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}
Barry Haddow Machine Translation 6 February 2012
6
Reordering
• Words may be reordered during translation
1 2 3 4
klein ist das Haus
the house is small
1 2 3 4
a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}
Barry Haddow Machine Translation 6 February 2012
7
One-to-many translation
• A source word may translate into multiple target words
1 2 3 4
das Haus ist klitzeklein
the house is very small
1 2 3 4 5
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}
Barry Haddow Machine Translation 6 February 2012
8
Dropping words
• Words may be dropped when translated
– The German article das is dropped
1 2 3 4
das Haus ist klein
house is small
1 2 3
a : {1 → 2, 2 → 3, 3 → 4}
Barry Haddow Machine Translation 6 February 2012
9
Inserting words
• Words may be added during translation
– The English just does not have an equivalent in German
– We still need to map it to something: special null token
0 1 2 3 4
NULL das Haus ist klein
the house is just small
1 2 3 4 5
a : {1 → 1, 2 → 2, 3 → 3, 4 → 0, 5 → 4}
Barry Haddow Machine Translation 6 February 2012
Translation with alignments
Let’s add the alignments as a variable into the probability model:
X
P(f1 . . . fm |e1 . . . e` ) = P(f1 . . . fm , a1 . . . am |e1 . . . e` )
a
=...
YX
= q(ai |i, `, m)t(fi |eai )
i ai
I t(fi |eai ) is the translation probability
I q(ai |i, `, m) is the alignment probability
Translation with alignments
Let’s add the alignments as a variable into the probability model:
X
P(f1 . . . fm |e1 . . . e` ) = P(f1 . . . fm , a1 . . . am |e1 . . . e` )
a
=...
YX
= q(ai |i, `, m)t(fi |eai )
i ai
I t(fi |eai ) is the translation probability
I q(ai |i, `, m) is the alignment probability
Independence assumptions:
I Word translations are independent given the alignments.
I Alignments are independent of each other.
Example
And the program has been implemented
La programmation a été mise en application
Example
And the program has been implemented
La programmation a été mise en application
P(f , a |e ) =q(2, 1, 6, 7) × t(La|the)
× q(3|2, 6, 7) × t(Programmation|program)
× q(4|3, 6, 7) × t(a|has)
× q(5|4, 6, 7) × t(été|been)
× q(6|5, 6, 7) × t(mise|implemented)
× q(6|6, 6, 7) × t(en|implemented)
× q(6|7, 6, 7) × t(application|implemented)
IBM alignment models
I IBM Model 1: q(j|i, `, m) = 1`
(All alignments are equally likely.)
Estimation and alignment
I To translate, we need t(f |e), the translation probabilities.
I If we knew the alignments, estimation would be easy.
count-of-programme-aligned-to-program
t(programme|program) =
count-of-program
P
i δ(fi =P
programme, eai = program)
X
=
hf ,ei∈examples j δ(ej = program)
Estimation and alignment
I To translate, we need t(f |e), the translation probabilities.
I If we knew the alignments, estimation would be easy.
count-of-programme-aligned-to-program
t(programme|program) =
count-of-program
P
i δ(fi =P
programme, eai = program)
X
=
hf ,ei∈examples j δ(ej = program)
I Conversely, getting the alignments would be easy if we knew
the translation probabilities.
P(f |e , a )P(a )
P(a |e , f ) = P
a 0 P(f |e , a )P(a )
0 0
Estimation and alignment
I To translate, we need t(f |e), the translation probabilities.
I If we knew the alignments, estimation would be easy.
count-of-programme-aligned-to-program
t(programme|program) =
count-of-program
P
i δ(fi =P
programme, eai = program)
X
=
hf ,ei∈examples j δ(ej = program)
I Conversely, getting the alignments would be easy if we knew
the translation probabilities.
P(f |e , a )P(a )
P(a |e , f ) = P
a 0 P(f |e , a )P(a )
0 0
I How to solve a chicken-and-egg problem? EM!
Example
I Translation probabilities
the house
la 0.4 0.1
maison 0.1 0.6
I Alignments
P(f , a |e ) P(a |f , e )
the → la, house → la
the → la, house → maison
the → maison, house → la
the → maison, house → maison
I Counts
the house
la
maison
Example
I Translation probabilities
the house
la 0.4 0.1
maison 0.1 0.6
I Alignments
P(f , a |e ) P(a |f , e )
the → la, house → la 0.4 × 0.1 = 0.04
the → la, house → maison 0.4 × 0.6 = 0.24
the → maison, house → la 0.1 × 0.6 = 0.06
the → maison, house → maison 0.1 × 0.1 = 0.01
I Counts
the house
la
maison
Example
I Translation probabilities
the house
la 0.4 0.1
maison 0.1 0.6
I Alignments
P(f , a |e ) P(a |f , e )
the → la, house → la 0.4 × 0.1 = 0.04 0.11
the → la, house → maison 0.4 × 0.6 = 0.24 0.69
the → maison, house → la 0.1 × 0.6 = 0.06 0.17
the → maison, house → maison 0.1 × 0.1 = 0.01 0.03
I Counts
the house
la
maison
Example
I Translation probabilities
the house
la 0.4 0.1
maison 0.1 0.6
I Alignments
P(f , a |e ) P(a |f , e )
the → la, house → la 0.4 × 0.1 = 0.04 0.11
the → la, house → maison 0.4 × 0.6 = 0.24 0.69
the → maison, house → la 0.1 × 0.6 = 0.06 0.17
the → maison, house → maison 0.1 × 0.1 = 0.01 0.03
I Counts
the house
la 0.11 + 0.69 = 0.8 0.11 + 0.17 = 0.28
maison 0.17 + 0.03 = 0.2 0.69 + 0.03 = 0.72
Example
I Translation probabilities
the house
la 0.4 0.1
maison 0.1 0.6
I Alignments
P(f , a |e ) P(a |f , e )
the → la, house → la 0.4 × 0.1 = 0.04 0.11
the → la, house → maison 0.4 × 0.6 = 0.24 0.69
the → maison, house → la 0.1 × 0.6 = 0.06 0.17
the → maison, house → maison 0.1 × 0.1 = 0.01 0.03
I Counts
the house
la 0.11 + 0.69 = 0.8 0.11 + 0.17 = 0.28
maison 0.17 + 0.03 = 0.2 0.69 + 0.03 = 0.72
Then we add up the counts across all the examples and update the translation
probabilities
14
EM algorithm
... la maison ... la maison blue ... la fleur ...
... the house ... the blue house ... the flower ...
• Initial step: all alignments equally likely
• Model learns that, e.g., la is often aligned with the
Barry Haddow Machine Translation 6 February 2012
15
EM algorithm
... la maison ... la maison blue ... la fleur ...
... the house ... the blue house ... the flower ...
• After one iteration
• Alignments, e.g., between la and the are more likely
Barry Haddow Machine Translation 6 February 2012
16
EM algorithm
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
• After another iteration
• It becomes apparent that alignments, e.g., between fleur and flower are more
likely (pigeon hole principle)
Barry Haddow Machine Translation 6 February 2012
17
EM algorithm
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
• Convergence
• Inherent hidden structure revealed by EM
Barry Haddow Machine Translation 6 February 2012
18
EM algorithm
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
p(la|the) = 0.453
p(le|the) = 0.334
p(maison|house) = 0.876
p(bleu|blue) = 0.563
...
• Parameter estimation from the aligned corpus
Barry Haddow Machine Translation 6 February 2012
19
IBM Model 1 and EM
• EM Algorithm consists of two steps
• Expectation-Step: Apply model to the data
– parts of the model are hidden (here: alignments)
– using the model, assign probabilities to possible values
• Maximization-Step: Estimate model from data
– take assign values as fact
– collect counts (weighted by probabilities)
– estimate model from counts
• Iterate these steps until convergence
Barry Haddow Machine Translation 6 February 2012
IBM alignment models
I IBM Model 1: q(j|i, `, m) = 1`
(All alignments are equally likely.)
IBM alignment models
I IBM Model 1: q(j|i, `, m) = 1`
(All alignments are equally likely.)
I IBM Model 2: q(j|i, `, m) = c(j|i,`,m)
c(i,`,m)
(Alignment probability is a parameter of the model.)
IBM Model 2
Add a prior probability on alignments q(ai |i, m, `).
I We estimate q along with t during the M-step.
I The joint probability
Q still decomposes across words:
P(e , a |f ) = i t(ei |fai )q(ai |i, `e , `f )
I But adding this parameter makes the likelihood non-convex.
I This means initialization affects the outcome.
I Initializing from IBM model 1 works well in practice.
31
IBM Model 2
1 2 3 4 5
natürlich ist das haus klein
lexical translation step
of course is the house small
alignment step
of course the house is small
1 2 3 4 5 6
Barry Haddow Machine Translation 6 February 2012
IBM alignment models
I IBM Model 1: q(j|i, `, m) = 1`
(All alignments are equally likely.)
I IBM Model 2: q(j|i, `, m) = c(j|i,`,m)
c(i,`,m)
(Alignment probability is a parameter of the model.)
IBM alignment models
I IBM Model 1: q(j|i, `, m) = 1`
(All alignments are equally likely.)
I IBM Model 2: q(j|i, `, m) = c(j|i,`,m)
c(i,`,m)
(Alignment probability is a parameter of the model.)
I IBM Model 3
(model “fertility”, the number of alignments per word)
32
IBM Model 3
Mary did not slap the green witch
n(3|slap)
Mary not slap slap slap the green witch
p-null
Mary not slap slap slap NULL the green witch
t(la|the)
Maria no daba una botefada a la verde bruja
d(4|4)
Maria no daba una bofetada a la bruja verde
Barry Haddow Machine Translation 6 February 2012
IBM alignment models
I IBM Model 1: q(j|i, `, m) = 1`
(All alignments are equally likely.)
I IBM Model 2: q(j|i, `, m) = c(j|i,`,m)
c(i,`,m)
(Alignment probability is a parameter of the model.)
I IBM Model 3
(model “fertility”, the number of alignments per word)
IBM alignment models
I IBM Model 1: q(j|i, `, m) = 1`
(All alignments are equally likely.)
I IBM Model 2: q(j|i, `, m) = c(j|i,`,m)
c(i,`,m)
(Alignment probability is a parameter of the model.)
I IBM Model 3
(model “fertility”, the number of alignments per word)
I IBM Models 4 and 5
I Condition on the alignment of the preceding word
I Like an HMM: P(ai |ai−1 , `, m)
IBM alignment models
I IBM Model 1: q(j|i, `, m) = 1`
(All alignments are equally likely.)
I IBM Model 2: q(j|i, `, m) = c(j|i,`,m)
c(i,`,m)
(Alignment probability is a parameter of the model.)
I IBM Model 3
(model “fertility”, the number of alignments per word)
I IBM Models 4 and 5
I Condition on the alignment of the preceding word
I Like an HMM: P(ai |ai−1 , `, m)
There are lots of papers on alignment alone, but Fraser and
Marcu (2007) note that improvements in alignment accuracy may
not improve overall translation.
Outline
The noisy channel
Estimation and alignment
Phrase-based translation
Decoding
Minimum Error Rate Training
Syntactic Machine Translation
Evaluation
Practicalities
Beyond Parallel Sentences
Problems with word-based translation
Word-based translation has obvious limitations:
I Multi-word alignments aren’t modeled well:
?
P(daba una botefada|slap) =
P(daba|slap)P(una|slap)P(botefada|slap)
I Many phrasal translations are non-compositional:
faire (make) le (the) menage (home) → clean up
I Alignment decisions for phrasal units should be made jointly:
la comida me gusta mucho → i like the food a lot
Problems with word-based translation
Word-based translation has obvious limitations:
I Multi-word alignments aren’t modeled well:
?
P(daba una botefada|slap) =
P(daba|slap)P(una|slap)P(botefada|slap)
I Many phrasal translations are non-compositional:
faire (make) le (the) menage (home) → clean up
I Alignment decisions for phrasal units should be made jointly:
la comida me gusta mucho → i like the food a lot
Two solutions: phrases and syntax
3
Phrase-Based Model
• Foreign input is segmented in phrases
• Each phrase is translated into English
• Phrases are reordered
Barry Haddow Machine Translation 13 February 2012
4
Phrase Translation Table
• Main knowledge source: table with phrase translations and their probabilities
• Example: phrase translations for natuerlich
Translation Probability φ(ē|f¯)
of course 0.5
naturally 0.3
of course , 0.15
, of course , 0.05
Barry Haddow Machine Translation 13 February 2012
7
Linguistic Phrases?
• Model is not limited to linguistic phrases
(noun phrases, verb phrases, prepositional phrases, ...)
• Example non-linguistic phrase pair
spass am → fun with the
• Prior noun often helps with translation of preposition
• Experiments show that limitation to linguistic phrases hurts quality
Barry Haddow Machine Translation 13 February 2012
8
Learning a Phrase Translation Table
• Task: learn the model from a parallel corpus
• Three stages:
– word alignment: using IBM models or other method
– extraction of phrase pairs
– scoring phrase pairs
Barry Haddow Machine Translation 13 February 2012
Phrase-based translation
Typically, we start with a symmetrized set of word alignments
I Align e to f
I Align f to e (not generally the same!)
I Take the intersection
10
Phrase Extraction Criteria
Maria no daba Maria no daba Maria no daba
Mary Mary Mary
did did did
not not not
X
slap slap slap
X
consistent inconsistent inconsistent
• Phrase alignment has to contain all alignment points for all covered words
• Phrase alignment has to contain at least one alignment point
Barry Haddow Machine Translation 13 February 2012
12
Word alignment induced phrases bofetada bruja
Maria no daba una a la verde
Mary
did
not
slap
the
green
witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green)
Barry Haddow Machine Translation 13 February 2012
13
Word alignment induced phrases bofetada bruja
Maria no daba una a la verde
Mary
did
not
slap
the
green
witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green),
(Maria no, Mary did not), (no daba una bofetada, did not slap), (daba una bofetada a la, slap the),
(bruja verde, green witch)
Barry Haddow Machine Translation 13 February 2012
14
Word alignment induced phrases bofetada bruja
Maria no daba una a la verde
Mary
did
not
slap
the
green
witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green),
(Maria no, Mary did not), (no daba una bofetada, did not slap), (daba una bofetada a la, slap the),
(bruja verde, green witch), (Maria no daba una bofetada, Mary did not slap),
(no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch)
Barry Haddow Machine Translation 13 February 2012
15
Word alignment induced phrases bofetada bruja
Maria no daba una a la verde
Mary
did
not
slap
the
green
witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green),
(Maria no, Mary did not), (no daba una bofetada, did not slap), (daba una bofetada a la, slap the),
(bruja verde, green witch), (Maria no daba una bofetada, Mary did not slap),
(no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch),
(Maria no daba una bofetada a la, Mary did not slap the),
(daba una bofetada a la bruja verde, slap the green witch)
Barry Haddow Machine Translation 13 February 2012
16
Word alignment induced phrases (5)
bofetada bruja
Maria no daba una a la verde
Mary
did
not
slap
the
green
witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green),
(Maria no, Mary did not), (no daba una bofetada, did not slap), (daba una bofetada a la, slap the),
(bruja verde, green witch), (Maria no daba una bofetada, Mary did not slap),
(no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch),
(Maria no daba una bofetada a la, Mary did not slap the), (daba una bofetada a la bruja verde,
slap the green witch), (no daba una bofetada a la bruja verde, did not slap the green witch),
(Maria no daba una bofetada a la bruja verde, Mary did not slap the green witch)
Barry Haddow Machine Translation 13 February 2012
17
Scoring Phrase Translations
• Phrase pair extraction: collect all phrase pairs from the data
• Phrase pair scoring: assign probabilities to phrase translations
• Score by relative frequency:
count(ē, f¯)
φ(f¯|ē) = P ¯
f¯i count(ē, fi )
Barry Haddow Machine Translation 13 February 2012
Outline
The noisy channel
Estimation and alignment
Phrase-based translation
Decoding
Minimum Error Rate Training
Syntactic Machine Translation
Evaluation
Practicalities
Beyond Parallel Sentences
30
Decoding Process
Maria no dio una bofetada a la bruja verde
• Build translation left to right
– select foreign words to be translated
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
31
Decoding Process
Maria no dio una bofetada a la bruja verde
Mary
• Build translation left to right
– select foreign words to be translated
– find English phrase translation
– add English phrase to end of partial translation
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
32
Decoding Process
Maria no dio una bofetada a la bruja verde
Mary
• Build translation left to right
– select foreign words to be translated
– find English phrase translation
– add English phrase to end of partial translation
– mark foreign words as translated
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
33
Decoding Process
Maria no dio una bofetada a la bruja verde
Mary did not
• One to many translation
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
34
Decoding Process
Maria no dio una bofetada a la bruja verde
Mary did not slap
• Many to one translation
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
35
Decoding Process
Maria no dio una bofetada a la bruja verde
Mary did not slap the
• Many to one translation
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
36
Decoding Process
Maria no dio una bofetada a la bruja verde
Mary did not slap the green
• Reordering
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
37
Decoding Process
Maria no dio una bofetada a la bruja verde
Mary did not slap the green witch
• Translation finished
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
38
Translation Options
Maria no dio una bofetada a la bruja verde
Mary not give a slap to the witch green
did not a slap by green witch
no slap to the
did not give to
the
slap the witch
• Look up possible phrase translations
– many different ways to segment words into phrases
– many different ways to translate each phrase
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
39
Hypothesis Expansion
Maria no dio una bofetada a la bruja verde
Mary not give a slap to the witch green
did not a slap by green witch
no slap to the
did not give to
the
slap the witch
e:
f: ---------
p: 1
• Start with empty hypothesis
– e: no English words
– f: no foreign words covered
– p: probability 1
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
40
Hypothesis Expansion
Maria no dio una bofetada a la bruja verde
Mary not give a slap to the witch green
did not a slap by green witch
no slap to the
did not give to
the
slap the witch
e: e: Mary
f: --------- f: *--------
p: 1 p: .534
• Pick translation option
• Create hypothesis
– e: add English phrase Mary
– f: first foreign word covered
– p: probability 0.534
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
41
A Quick Word on Probabilities
• Not going into detail here, but...
• Translation Model
– phrase translation probability p(Mary|Maria)
– reordering costs
– phrase/word count costs
– ...
• Language Model
– uses trigrams:
– p(Mary did not) =
p(Mary|START) ×p(did|Mary,START) × p(not|Mary did)
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
42
Hypothesis Expansion
Maria no dio una bofetada a la bruja verde
Mary not give a slap to the witch green
did not a slap by green witch
no slap to the
did not give to
the
slap the witch
e: witch
f: -------*-
p: .182
e: e: Mary
f: --------- f: *--------
p: 1 p: .534
• Add another hypothesis
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
43
Hypothesis Expansion
Maria no dio una bofetada a la bruja verde
Mary not give a slap to the witch green
did not a slap by green witch
no slap to the
did not give to
the
slap the witch
e: witch e: ... slap
f: -------*- f: *-***----
p: .182 p: .043
e: e: Mary
f: --------- f: *--------
p: 1 p: .534
• Further hypothesis expansion
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
44
Hypothesis Expansion
Maria no dio una bofetada a la bruja verde
Mary not give a slap to the witch green
did not a slap by green witch
no slap to the
did not give to
the
slap the witch
e: witch e: slap
f: -------*- f: *-***----
p: .182 p: .043
e: e: Mary e: did not e: slap e: the e:green witch
f: --------- f: *-------- f: **------- f: *****---- f: *******-- f: *********
p: 1 p: .534 p: .154 p: .015 p: .004283 p: .000271
• ... until all foreign words covered
– find best hypothesis that covers all foreign words
– backtrack to read off translation
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
45
Hypothesis Expansion
Maria no dio una bofetada a la bruja verde
Mary not give a slap to the witch green
did not a slap by green witch
no slap to the
did not give to
the
slap the witch
e: witch e: slap
f: -------*- f: *-***----
p: .182 p: .043
e: e: Mary e: did not e: slap e: the e:green witch
f: --------- f: *-------- f: **------- f: *****---- f: *******-- f: *********
p: 1 p: .534 p: .154 p: .015 p: .004283 p: .000271
• Adding more hypothesis
⇒ Explosion of search space
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
46
Explosion of Search Space
• Number of hypotheses is exponential with respect to sentence length
⇒ Decoding is NP-complete [Knight, 1999]
⇒ Need to reduce search space
– risk free: hypothesis recombination
– risky: histogram/threshold pruning
Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008
Outline
The noisy channel
Estimation and alignment
Phrase-based translation
Decoding
Minimum Error Rate Training
Syntactic Machine Translation
Evaluation
Practicalities
Beyond Parallel Sentences
Freedom from the rusty manacles of probability
I In the noisy channel model, decoding maximizes:
log P(f |e) + log P(e)
Freedom from the rusty manacles of probability
I In the noisy channel model, decoding maximizes:
log P(f |e) + log P(e)
I We might decide these components are not equally important:
λ1 log P(f |e) + λ2 log P(e)
Freedom from the rusty manacles of probability
I In the noisy channel model, decoding maximizes:
log P(f |e) + log P(e)
I We might decide these components are not equally important:
λ1 log P(f |e) + λ2 log P(e)
I But this is just a log-linear model.
Why not add other features?
λ1 log P(f |e) + λ2 log P(e) + λ3 log P(e|f ) + . . .
Minimum error-rate training (MERT)
Our new objective:
λ1 log P(f |e) + λ2 log P(e) + λ3 log P(e|f ) + . . .
I How to set λ?
Maximize Bleu score on dev-set.
Minimum error-rate training (MERT)
Our new objective:
λ1 log P(f |e) + λ2 log P(e) + λ3 log P(e|f ) + . . .
I How to set λ?
Maximize Bleu score on dev-set.
I This will help you get a higher Bleu
score on test data. ,
I But how much do we really trust Bleu?
I We can’t get anything like a gradient of
the Bleu score with respect to λ, so
learning is difficult, especially with many
features. /
I (But see Galley et al EMNLP 2013 for
some progress)
Outline
The noisy channel
Estimation and alignment
Phrase-based translation
Decoding
Minimum Error Rate Training
Syntactic Machine Translation
Evaluation
Practicalities
Beyond Parallel Sentences
The Vauquois Triangle
Is translation easier at the syntactic level?
la empresa tiene enemigos fuertes en Europa .
the company has strong enemies in Europe .
Garcia and associates . the clients and the associates are enemies .
Garcia y asociados . los clientes y los asociados son enemigos .
Carlos Garcia has three associates . the company has three groups .
Carlos Garcia tiene tres asociados . la empresa tiene tres grupos .
his associates are not strong . its groups are in Europe .
sus asociados no son fuertes . sus grupos estan en Europa .
Garcia has a company also . the modern groups sell strong pharmaceuticals .
Garcia tambien tiene una empresa . los grupos modernos venden medicinas fuertes .
its clients are angry . the groups do not sell zanzanine .
sus clientes estan enfadados . los grupos no venden zanzanina .
the associates are also angry . the small groups are not modern .
los asociados tambien estan enfadados . los grupos pequenos no son modernos .
la empresa tiene enemigos fuertes en Europa .
the company has strong enemies in Europe .
Garcia and associates . the clients and the associates are enemies .
Garcia y asociados . los clientes y los asociados son enemigos .
Carlos Garcia has three associates . the company has three groups .
Carlos Garcia tiene tres asociados . la empresa tiene tres grupos .
his associates are not strong . its groups are in Europe .
sus asociados no son fuertes . sus grupos estan en Europa .
Garcia has a company also . the modern groups sell strong pharmaceuticals .
Garcia tambien tiene una empresa . los grupos modernos venden medicinas fuertes .
its clients are angry . the groups do not sell zanzanine .
sus clientes estan enfadados . los grupos no venden zanzanina .
the associates are also angry . the small groups are not modern .
los asociados tambien estan enfadados . los grupos pequenos no son modernos .
la empresa tiene enemigos fuertes en Europa .
the company has strong enemies in Europe .
Garcia and associates . the clients and the associates are enemies .
Same pattern:
Garcia y asociados . los clientes y los asociados son enemigos .
NN JJ
Carlos Garcia has three associates . → JJtheNN company has three groups .
Carlos Garcia tiene tres asociados . la empresa tiene tres grupos .
his associates are not strong . its groups are in Europe .
sus asociados no son fuertes . sus grupos estan en Europa .
Garcia has a company also . the modern groups sell strong pharmaceuticals .
Garcia tambien tiene una empresa . los grupos modernos venden medicinas fuertes .
its clients are angry . the groups do not sell zanzanine .
sus clientes estan enfadados . los grupos no venden zanzanina .
the associates are also angry . the small groups are not modern .
los asociados tambien estan enfadados . los grupos pequenos no son modernos .
la empresa tiene enemigos fuertes en Europa .
the company has strong enemies in Europe .
Garcia and associates . the clients and the associates are enemies .
Same pattern:
Garcia y asociados . los clientes y los asociados son enemigos .
NN JJ
Carlos Garcia has three associates . → JJtheNN company has three groups .
Carlos Garcia tiene tres asociados .la empresa tiene tres grupos .
Finite-state models do
his associates are not strong .
not capture
its groups are in Europe .
this generalization.
sus asociados no son fuertes . sus grupos estan en Europa .
Garcia has a company also . the modern groups sell strong pharmaceuticals .
Garcia tambien tiene una empresa . los grupos modernos venden medicinas fuertes .
its clients are angry . the groups do not sell zanzanine .
sus clientes estan enfadados . los grupos no venden zanzanina .
the associates are also angry . the small groups are not modern .
los asociados tambien estan enfadados . los grupos pequenos no son modernos .
Let’s do an example
Context-Free Grammar
Context-Free Grammar
S → NP VP
NP → watashi wa
NP → hako wo
VP → NP V
V → akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa
NP → hako wo
VP → NP V
V → akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa
NP → hako wo
VP → NP V
V → akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
V → akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
V → akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa
V → akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa
V → akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu
hako wo
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu
hako wo
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu
hako wo akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu
hako wo akemasu
Context-Free Grammar
S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu
hako wo akemasu
watashi wa hako wo akemasu
Synchronous Context-Free Grammar
S → NP VP
NP → watashi wa
NP → hako wo
VP → NP V
V → akemasu
Synchronous Context-Free Grammar
S → NP VP S → NP VP
NP → watashi wa NP → I
NP → hako wo NP → the box
VP → NP V VP → V NP
V → akemasu V → open
Synchronous Context-Free Grammar
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa I
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa I
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa NP V I V NP
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa NP V I V NP
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa NP V I V NP
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa NP V I V NP
hako wo the box
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa NP V I V NP
hako wo the box
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa NP V I V NP
hako wo akemasu open the box
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa NP V I V NP
hako wo akemasu open the box
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa NP V I V NP
hako wo akemasu open the box
watashi wa hako wo akemasu
Synchronous Context-Free Grammar
S S
NP VP NP VP
watashi wa NP V I V NP
hako wo akemasu open the box
watashi wa hako wo akemasu I open the box
Translation as Parsing
watashi wa hako wo akemasu
Translation as Parsing
S
NP VP
watashi wa NP V
hako wo akemasu
watashi wa hako wo akemasu
Translation as Parsing
S S
NP VP NP VP
watashi wa NP V I V NP
hako wo akemasu open the box
watashi wa hako wo akemasu
Translation as Parsing
S S
NP VP NP VP
watashi wa NP V I V NP
hako wo akemasu open the box
watashi wa hako wo akemasu I open the box
Synchronous grammars for semantic parsing
Wong and Mooney (2007)
The big question
The Big Question
Where do the categories come from?
The Big Question
Where do the categories come from?
Answer #1: there are no categories!
The Big Question
Where do the categories come from?
Answer #1: there are no categories!
X → X1 X2 / X1 X2
X → X1 X2 / X2 X1
X → watashi wa / I
X → hako wo / the box
X → akemasu / open
The Big Question
Where do the categories come from?
Answer #1: there are no categories!
X → X1 X2 / X1 X2 Keep order
X → X1 X2 / X2 X1
X → watashi wa / I
X → hako wo / the box
X → akemasu / open
The Big Question
Where do the categories come from?
Answer #1: there are no categories!
X → X1 X2 / X1 X2 Keep order
X → X1 X2 / X2 X1 Swap order
X → watashi wa / I
X → hako wo / the box
X → akemasu / open
The Big Question
Where do the categories come from?
Answer #1: there are no categories!
X → X1 X2 / X1 X2 Keep order
X → X1 X2 / X2 X1 Swap order
X → watashi wa / I
X → hako wo / the box
X → akemasu / open
The Big Question
Where do the categories come from?
Answer #1: there are no categories!
X → X1 X2 / X1 X2 Keep order
X → X1 X2 / X2 X1 Swap order
X → watashi wa / I
Translate words
X → hako wo / the box
or phrases
X → akemasu / open
The Big Question
Where do the categories come from?
Answer #1: there are no categories!
X X
X X X X
watashi wa X X I X X
hako wo akemasu open the box
Inversion Transduction Grammar
Inversion Transduction Grammar
Parsing is polynomial. We must be giving up something
in order to acheive polynomial complexity.
Inversion Transduction Grammar
Parsing is polynomial. We must be giving up something
in order to acheive polynomial complexity.
A B C D
B D A C
Inversion Transduction Grammar
Parsing is polynomial. We must be giving up something
in order to acheive polynomial complexity.
A B C D
B D A C
Inversion Transduction Grammar
Parsing is polynomial. We must be giving up something
in order to acheive polynomial complexity.
A B C D
B D A C
ITG cannot produce this kind of reordering.
Inversion Transduction Grammar
Parsing is polynomial. We must be giving up something
in order to acheive polynomial complexity.
A B C D
B D A C
ITG cannot produce this kind of reordering.
Does this matter? Do such reorderings occur in real data?
Inversion Transduction Grammar
je that
pense ,
que I
, believe
independamment ,
de we
notre all
parti find
, unacceptable
nous ,
trouvons regardless
tous of
cela political
inacceptable party
ITG cannot produce this kind of reordering.
Does this matter? Do such reorderings occur in real data?
YES!
Inversion Transduction Grammar
je that
pense ,
que I
, believe
independamment ,
de we
notre all
parti find
, unacceptable
nous ,
trouvons regardless
tous of
cela political
inacceptable party
ITG cannot produce this kind of reordering.
Does this matter? Do such reorderings occur in real data?
YES! (but they’re very rare)
Hierarchical Phrase-Based Translation
X → X1 hako wo X2 / X1 open X2
X → hako wo / the box
X → akemasu / open
X X
X hako wo X X open X
watashi wa akemasu I the box
The Big Question
Where do the categories come from?
The Big Question
Where do the categories come from?
Answer #2: from a parser.
The Big Question
Where do the categories come from?
Answer #2: from a parser.
S → NP1 VP2 / NP1 VP2
NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Syntax-based Translation
S S
NP VP NP VP
watashi wa NP V I V NP
hako wo akemasu open the box
Are reorderings in real data consistent with
isomorphisms on linguistic parse trees?
Syntax-based Translation
S
VP
SBAR
PRP VBD PRP VB
I saw her duck
yo la vi agacharse
Are reorderings in real data consistent with
isomorphisms on linguistic parse trees?
Of course not.
Syntax-based Translation
S
VP
SBAR
PRP VBD PRP VB
I saw her duck
yo la vi agacharse
Syntax-based Translation
S
VP Tree substitution grammar
SBAR
PRP VBD PRP VB
I saw her duck
yo la vi agacharse
Syntax-based Translation
S
VP Tree substitution grammar
weakly equivalent SCFG
PRP VBD VB
I saw her duck
yo la vi agacharse
Syntax-based Translation
S
VP Tree substitution grammar
weakly equivalent SCFG
PRP VBD VB
I saw her duck
VBD → saw /vi
yo la vi agacharse VB → duck /agacharse
S → PRP1 VP2 / PRP1 VP2
PRP → I / yo
VP → VBD1 her VB2 / la VBD1 VB2
Syntax-based Translation
S
VP Tree substitution grammar
weakly equivalent SCFG
PRP VBD VB Problem: we need a parser!
I saw her duck
VBD → saw /vi
yo la vi agacharse VB → duck /agacharse
S → PRP1 VP2 / PRP1 VP2
PRP → I / yo
VP → VBD1 her VB2 / la VBD1 VB2
The Big Question
Where do the categories come from?
The Big Question
Where do the categories come from?
Answer #3: they are automatically induced!
The Big Question
Where do the categories come from?
Answer #3: they are automatically induced!
This is an area of active research.
[Link]/workshops/ws10/groups/msgismt/
Outline
The noisy channel
Estimation and alignment
Phrase-based translation
Decoding
Minimum Error Rate Training
Syntactic Machine Translation
Evaluation
Practicalities
Beyond Parallel Sentences
More has been written about machine
translation evaluation than about
machine translation itself.
- Yorrick Wilks
Friday, July 1, 2011
Human evaluation
Have annotators read and assess translations
I adequacy: “e covers the same content as f ”
I fluency: “e looks like English”
Problems with human evaluation
I People hate evaluating translation, especially bad translation.
I People don’t tend to agree with each other.
You try it:
I A: furious nAgA on wednesday , the tribal minimum pur of
ten schools also was burnt
I B: furious nAgA on wednesday the tribal pur mini ten
schools of them was also burnt
Automatic evaluation
Automatic evaluation of MT is hard:
I There are many correct ways to translate something.
I Measuring fluency is an open problem.
I Measuring adequacy is even harder (AI-complete?)
Automatic evaluation
Automatic evaluation of MT is hard:
I There are many correct ways to translate something.
I Measuring fluency is an open problem.
I Measuring adequacy is even harder (AI-complete?)
That said...
I Computers don’t mind the work.
I Rapid evaluation supports iterative development.
I Evaluation can use fancier models than translation itself, since
only a few hypotheses must be considered.
BLEU
• Most widespread automatic evaluation statistic
by far
• 0.0 (worst) -1.0 (best)
• Computes n-gram overlap of a hypothesis with
one or more references
• Weighted average of precisions
• “Brevity penalty” that kicks in if the
hypothesis translation is too short
Friday, July 1, 2011
Computing BLEU
• n-gram overlap is computed on a per-
segment basis
• a reference length is determined for each
segment: what is the closest length in the
set of reference?
• Statistics are aggregated over the corpus
4
�
max{|h|−r,0}
BLEU4 = e prec(n)1/4
n=1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1
prec(1) =
1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1
prec(1) =
1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1
prec(1) =
1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1
prec(1) =
1+1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1
prec(1) =
1+1+1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1
prec(1) =
1+1+1+1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1+1
prec(1) =
1+1+1+1+1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1+1+1
prec(1) = = 0.875
1+1+1+1+1+1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1+1+1
prec(1) = = 0.875
1+1+1+1+1+1+1+1
prec(2) =
1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1+1+1
prec(1) = = 0.875
1+1+1+1+1+1+1+1
prec(2) =
1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1+1+1
prec(1) = = 0.875
1+1+1+1+1+1+1+1
1
prec(2) =
1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1+1+1
prec(1) = = 0.875
1+1+1+1+1+1+1+1
1+1
prec(2) =
1+1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1+1+1
prec(1) = = 0.875
1+1+1+1+1+1+1+1
1+1+1
prec(2) =
1+1+1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1+1+1
prec(1) = = 0.875
1+1+1+1+1+1+1+1
1+1+1+1
prec(2) =
1+1+1+1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1+1+1
prec(1) = = 0.875
1+1+1+1+1+1+1+1
1+1+1+1+1
prec(2) = = 0.714
1+1+1+1+1+1+1
Friday, July 1, 2011
hypothesis: ‘ extension of isi in uttar pradesh ’
ref 1: ‘ isi ’s expansion in uttar pradesh ’
ref 2: ‘ the spread of isi in uttar pradesh ’
ref 3: ‘ isi spreading in uttar pradesh ’
ref 4: the spread of isi in uttar pradesh
1+1+1+1+1+1+1
prec(1) = = 0.875
1+1+1+1+1+1+1+1
1+1+1+1+1
prec(2) = = 0.714
1+1+1+1+1+1+1
prec(3) = 0.666
prec(4) = 0.6
Friday, July 1, 2011
Outline
The noisy channel
Estimation and alignment
Phrase-based translation
Decoding
Minimum Error Rate Training
Syntactic Machine Translation
Evaluation
Practicalities
Beyond Parallel Sentences
What you need to do MT
I Parallel corpus
I Word alignment
I Language modeling
I Decoder
Parallel corpora
I The Linguistic Data Consortium will sell you:
I United Nations data
I Canadian Hansards
I Hong Kong laws parallel text
I Parallel newswire
I [Link]
I Or, you can download:
EuroParl [Link]
Word alignment
I Giza++ is an open-source implementation of the IBM models
I [Link]
Language modeling
SRILM (Stanford Research Institute Language Model)
I Developed for speech recognition, but works for MT
I All kinds of fancy smoothing algorithms
I [Link]
Decoder
I cdec ([Link]
I Moses ([Link]
I Contains code for the entire MT pipeline,
including decoding.
I Decent-looking documentation
I Doing MT for a course project might be amibitious,
but you might be able to experiment with one of the
components and leave the rest as black boxes.
Outline
The noisy channel
Estimation and alignment
Phrase-based translation
Decoding
Minimum Error Rate Training
Syntactic Machine Translation
Evaluation
Practicalities
Beyond Parallel Sentences
What can you do without much parallel data?
Munteneau and Marcu (2005), “Improving Machine Translation
Performance by Exploiting Non-Parallel Corpora”
I suppose we only have a few aligned sentences
I but lots of possibly-aligned documents
I idea: use the parallel corpus to train a system to find more
parallel documents and sentences.
What can you do without much parallel data?
What can you do without much parallel data?
Wikipedia as a parallel corpus
Smith, Quirk, and Toutanova, “Extracting Parallel Sentences from
Comparable Corpora Using Document Level Alignment”
I Thanks to Wikipedia, aligned documents are easy to obtain
for many language pairs:
I Similar idea to M&M: train a model (CRF) to identify aligned
sentences, add these to the MT system.
I Note: you still need some sentence-aligned data to train your
initial model.
Wikipedia as a parallel corpus
For “medium” sized-corpora, adding wikipedia sentences helps.
Wikipedia as a parallel corpus
For “medium” sized-corpora, adding wikipedia sentences helps.
If you’re evaluating on a test set of Wikipedia,
it’s helpful to add Wikipedia to your training set.
What can you do without ANY parallel data?
Haghighi et al, “Learning Bilingual Lexicons from Monolingual Corpora”
I Latent space models capture word similarity by factoring a
matrix of local context counts, fi ≈ Wzi
I fi is the feature vector for word i (e.g., context counts)
I zi is the latent representation for word i
I W is some projection matrix
What can you do without ANY parallel data?
Haghighi et al, “Learning Bilingual Lexicons from Monolingual Corpora”
I Latent space models capture word similarity by factoring a
matrix of local context counts, fi ≈ Wzi
I fi is the feature vector for word i (e.g., context counts)
I zi is the latent representation for word i
I W is some projection matrix
I Key idea: apply this to two languages at once,
and automatically discover a translation lexicon.
I Aligned words should have identical zi .
I fS (si ) ≈ WS zi
I fT (ti ) ≈ WT zi
Learning the lexicon
I Start with a seed set of 100 words
I Viterbi (hard) EM. Iterate:
I Compute WS , WT , zi for all i
I Find the best alignment (bipartite mapping)
Results
I The approach works well for English and French:
I Not so well for Chinese and Arabic.
I Having a good seed helps.
(but inducing seeds from edit distance works ok)
I Having similar corpora (Wikipedia vs Gigaword) helps.
Results
Results
Recap
I Key pieces for machine translation:
I Alignment: word-to-word, phrase-to-phrase, or syntactic.
I Estimation: MLE for noisy-channel, MERT for component
weights.
I Decoding: Reordering makes beam search crucial.
I Evaluation: BLEU counts n-gram overlap
Recap
I Key pieces for machine translation:
I Alignment: word-to-word, phrase-to-phrase, or syntactic.
I Estimation: MLE for noisy-channel, MERT for component
weights.
I Decoding: Reordering makes beam search crucial.
I Evaluation: BLEU counts n-gram overlap
I Syntactic machine translation is mainly based on
synchronous context-free grammars.
Recap
I Key pieces for machine translation:
I Alignment: word-to-word, phrase-to-phrase, or syntactic.
I Estimation: MLE for noisy-channel, MERT for component
weights.
I Decoding: Reordering makes beam search crucial.
I Evaluation: BLEU counts n-gram overlap
I Syntactic machine translation is mainly based on
synchronous context-free grammars.
I Translation without parallel text:
I Find parallel documents on the web.
I Use Wikipedia, learn to classify parallel sentences.
I Use latent space models to build bilingual lexicons.