0% found this document useful (0 votes)

14 views236 pages

MT Slides

Uploaded by

Aryan Tripathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views236 pages

MT Slides

Uploaded by

Aryan Tripathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS 4650/7650

Machine Translation 1

Jacob Eisenstein1

November 30, 2015

1
with slides from David Chiang, Chris Dyer, and Phillip Koehn
Overview of machine translation

Why is MT hard?
Word order

I English: subject-verb-object
I Japanese: subject-object-verb
I Examples
I English: IBM bought Lotus
Japanese: IBM Lotus bought
Word order

I English: subject-verb-object
I Japanese: subject-object-verb
I Examples
I English: IBM bought Lotus
Japanese: IBM Lotus bought
I French: subject-verb-object... except for pronouns
I English: I will buy it
French: Je vais l’acheter (I will it buy)
I English: I bought it
French: Je l’ai acheté (I it have bought)
I How many orderings are there?
Word sense ambiguity

I We’ve already talked about how bill translates as pico (bird)

or cuenta (cost).
Word sense ambiguity

I We’ve already talked about how bill translates as pico (bird)

or cuenta (cost).
I Legs and feet in English and French:
Word sense ambiguity

I We’ve already talked about how bill translates as pico (bird)

or cuenta (cost).
I Legs and feet in English and French:

I Lexical gaps are when a language lacks a word for a concept.

I My favorite English lexical gap: schadenfreude, a German
word for “pleasure taken from the misfortune of others.”
Pronouns

Pronoun morphology can convey different information in each

language:
I English possessive pronouns take the gender of the owner:
Marie rides her bike
I French possessive pronouns take the gender of the object:
Marie monte sur son vélo
Pronouns

Pronoun morphology can convey different information in each

language:
I English possessive pronouns take the gender of the owner:
Marie rides her bike
I French possessive pronouns take the gender of the object:
Marie monte sur son vélo
I Translating into English requires:
I Anaphora resolution: son to Marie.
I Gender determination: Marie is female.
I Google Translate’s treatment of this one is interesting.
Compare Marie monte son vélo versus Marie vend son vélo.
Pronouns

Many languages (Spanish, Japanese, Chinese, Turkish, Arabic etc.)

can drop pronouns.
I In Spanish, you can recover the pronoun from verb inflection:
Vivimos en Atlanta → We live in Atlanta
I Again, discourse context is often crucial:
Vive en Atlanta → She/he/it lives in Atlanta
Chinese example:
Tense and case

I Spanish has two past tenses.

I The preterite tense is for events with a definite time, e.g.
I biked to work this morning
I The imperfect is for events with indefinite times, e.g.
I biked to work all last summer
I To translate English to Spanish, we must pick the right tense.
This seems to require some deeper semantic understanding.
I Many languages have richer case morphology than English;
translating to these languages requires identifying whether the
word should be in the accusative, dative, etc.
Idioms

I Why in the world

I Kick the bucket
I Lend me your ears
I ...
The Vauquois Triangle
Outline

The noisy channel

Estimation and alignment

Phrase-based translation

Decoding

Minimum Error Rate Training

Syntactic Machine Translation

Evaluation

Practicalities

Beyond Parallel Sentences

The noisy channel model

I Remember the noisy channel model?2

I This is a general framework for thinking about translation
(and many other NLP problems).

2
next few slides from Chris Dyer
One naturally wonders if the problem
of translation could conceivably be
treated as a problem in cryptography.
When I look at an article in Russian, I
say: ‘This is really written in
English, but it has been coded in
some strange symbols. I will now
proceed to decode.’

Warren Weaver to Norbert Wiener, March, 1947

Friday, July 1, 2011
Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)
y � = arg max p(y|x)
y
p(x|y)p(y)
= arg max
y p(x)
= arg max p(x|y)p(y)
y

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)
�=
y � = arg max p(y|x)
y
p(x|y)p(y)
= arg max
y p(x)
= arg max p(x|y)p(y)
y

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
p(y) p(x|y)
�=
y � = arg max p(y|x)
y
p(x|y)p(y)
I can help.
= arg max
y p(x)
= arg max p(x|y)p(y)
y

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message

y � = arg max p(y|x)

y
p(x|y)p(y)
= arg max
y p(x)
= arg max p(x|y)p(y)
y

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message

y � = arg max p(y|x)

y
p(x|y)p(y)
= arg max
y p(x)
= arg max p(x|y)p(y)
Denominator doesn’t
y depend on y .

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message

y � = arg max p(y|x)

y
p(x|y)p(y)
= arg max
y p(x)
= arg max p(x|y)p(y)
y

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message

y � = arg max p(x|y)p(y)

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
English “French” English’

y � = arg max p(x|y)p(y)

e� = arg max p(f|e)p(e)

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
English “French” English’

y � = arg max p(x|y)p(y)

e� = arg max p(f|e)p(e)

translation model

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
English “French” English’

y � = arg max p(x|y)p(y)

e� = arg max p(f|e)p(e)

translation model language model

Friday, July 1, 2011

Y “Noisy” X YM� �
Decoder
channel
Sent Received Recovered
transmission transmission message
English “French” English’

y � = arg max p(x|y)p(y)

e� = arg max p(f|e)p(e)

translation model language model

Other noisy channel applications: OCR, speech
recognition, spelling correction...
Friday, July 1, 2011
Division of labor
• Translation model
• probability of translation back into the
source
• ensures adequacy of translation
• Language model
• is a translation hypothesis “good” English?
• ensures fluency of translation
Friday, July 1, 2011
English

S'il vous plaît traduire...

learner
decoder LM
p(e)
�
e = arg max p(e | f) English français
e
TM
p(f | e)
learner

Please translate...

Friday, July 1, 2011

Outline

The noisy channel

Estimation and alignment

Phrase-based translation

Decoding

Minimum Error Rate Training

Syntactic Machine Translation

Evaluation

Practicalities

Beyond Parallel Sentences

Estimating the components

I We’ve already learned how to estimate language models.

I Google uses n-grams of order 5-7.
I Smoothing is really important;
so is being clever about decoding.
Estimating the components

I We’ve already learned how to estimate language models.

I Google uses n-grams of order 5-7.
I Smoothing is really important;
so is being clever about decoding.
I Estimating the translation model is more difficult.
I e = And the program was implemented
I f = La programmation a été mise en application
I P(f |e ) is really hard to define.
I Easier: something like
P(la|the) × P(programme|programmation) × ...
I But we are given aligned sentences, not aligned words.
4

Alignment
• In a parallel text (or when we translate), we align words in one language with
the words in the other

1 2 3 4
das Haus ist klein

the house is small

1 2 3 4

• Word positions are numbered 1–4

Barry Haddow Machine Translation 6 February 2012

Alignment function
• Formalizing alignment with an alignment function

• Mapping an English target word at position i to a German source word at

position j with a function a : i → j

• Example
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}

Barry Haddow Machine Translation 6 February 2012

Reordering
• Words may be reordered during translation

1 2 3 4
klein ist das Haus

the house is small

1 2 3 4

a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}

Barry Haddow Machine Translation 6 February 2012

One-to-many translation
• A source word may translate into multiple target words

1 2 3 4
das Haus ist klitzeklein

the house is very small

1 2 3 4 5

a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}

Barry Haddow Machine Translation 6 February 2012

Dropping words
• Words may be dropped when translated
– The German article das is dropped

1 2 3 4
das Haus ist klein

house is small
1 2 3

a : {1 → 2, 2 → 3, 3 → 4}

Barry Haddow Machine Translation 6 February 2012

Inserting words
• Words may be added during translation
– The English just does not have an equivalent in German
– We still need to map it to something: special null token

0 1 2 3 4
NULL das Haus ist klein

the house is just small

1 2 3 4 5

a : {1 → 1, 2 → 2, 3 → 3, 4 → 0, 5 → 4}

Barry Haddow Machine Translation 6 February 2012

Translation with alignments

Let’s add the alignments as a variable into the probability model:

X
P(f1 . . . fm |e1 . . . e` ) = P(f1 . . . fm , a1 . . . am |e1 . . . e` )
a
=...
YX
= q(ai |i, `, m)t(fi |eai )
i ai

I t(fi |eai ) is the translation probability

I q(ai |i, `, m) is the alignment probability
Translation with alignments

Let’s add the alignments as a variable into the probability model:

X
P(f1 . . . fm |e1 . . . e` ) = P(f1 . . . fm , a1 . . . am |e1 . . . e` )
a
=...
YX
= q(ai |i, `, m)t(fi |eai )
i ai

I t(fi |eai ) is the translation probability

I q(ai |i, `, m) is the alignment probability
Independence assumptions:
I Word translations are independent given the alignments.
I Alignments are independent of each other.
Example
And the program has been implemented

La programmation a été mise en application

Example
And the program has been implemented

La programmation a été mise en application

P(f , a |e ) =q(2, 1, 6, 7) × t(La|the)

I IBM Model 1: q(j|i, `, m) = 1`

(All alignments are equally likely.)
Estimation and alignment

I To translate, we need t(f |e), the translation probabilities.

I If we knew the alignments, estimation would be easy.
count-of-programme-aligned-to-program
t(programme|program) =
count-of-program
P
i δ(fi =P
programme, eai = program)
X
=
hf ,ei∈examples j δ(ej = program)
Estimation and alignment

I To translate, we need t(f |e), the translation probabilities.

I Conversely, getting the alignments would be easy if we knew

the translation probabilities.

P(f |e , a )P(a )
P(a |e , f ) = P
a 0 P(f |e , a )P(a )
0 0
Estimation and alignment

I To translate, we need t(f |e), the translation probabilities.

I Conversely, getting the alignments would be easy if we knew

the translation probabilities.

P(f |e , a )P(a )
P(a |e , f ) = P
a 0 P(f |e , a )P(a )
0 0

I How to solve a chicken-and-egg problem? EM!

Example

I Translation probabilities
the house
la 0.4 0.1
maison 0.1 0.6
I Alignments
P(f , a |e ) P(a |f , e )
the → la, house → la
the → la, house → maison
the → maison, house → la
the → maison, house → maison
I Counts
the house
la
maison
Example

I Translation probabilities
the house
la 0.4 0.1
maison 0.1 0.6
I Alignments
P(f , a |e ) P(a |f , e )
the → la, house → la 0.4 × 0.1 = 0.04
the → la, house → maison 0.4 × 0.6 = 0.24
the → maison, house → la 0.1 × 0.6 = 0.06
the → maison, house → maison 0.1 × 0.1 = 0.01
I Counts
the house
la
maison
Example

I Translation probabilities
the house
la 0.4 0.1
maison 0.1 0.6
I Alignments
P(f , a |e ) P(a |f , e )
the → la, house → la 0.4 × 0.1 = 0.04 0.11
the → la, house → maison 0.4 × 0.6 = 0.24 0.69
the → maison, house → la 0.1 × 0.6 = 0.06 0.17
the → maison, house → maison 0.1 × 0.1 = 0.01 0.03
I Counts
the house
la 0.11 + 0.69 = 0.8 0.11 + 0.17 = 0.28
maison 0.17 + 0.03 = 0.2 0.69 + 0.03 = 0.72
Example

EM algorithm
... la maison ... la maison blue ... la fleur ...

... the house ... the blue house ... the flower ...

• Initial step: all alignments equally likely

• Model learns that, e.g., la is often aligned with the

Barry Haddow Machine Translation 6 February 2012

EM algorithm
... la maison ... la maison blue ... la fleur ...

... the house ... the blue house ... the flower ...

• After one iteration

• Alignments, e.g., between la and the are more likely

Barry Haddow Machine Translation 6 February 2012

EM algorithm
... la maison ... la maison bleu ... la fleur ...

... the house ... the blue house ... the flower ...

• After another iteration

• It becomes apparent that alignments, e.g., between fleur and flower are more
likely (pigeon hole principle)

Barry Haddow Machine Translation 6 February 2012

EM algorithm
... la maison ... la maison bleu ... la fleur ...

... the house ... the blue house ... the flower ...

• Convergence

• Inherent hidden structure revealed by EM

Barry Haddow Machine Translation 6 February 2012

EM algorithm
... la maison ... la maison bleu ... la fleur ...

... the house ... the blue house ... the flower ...

p(la|the) = 0.453
p(le|the) = 0.334
p(maison|house) = 0.876
p(bleu|blue) = 0.563
...

• Parameter estimation from the aligned corpus

Barry Haddow Machine Translation 6 February 2012

IBM Model 1 and EM

• EM Algorithm consists of two steps

• Expectation-Step: Apply model to the data

– parts of the model are hidden (here: alignments)
– using the model, assign probabilities to possible values

• Maximization-Step: Estimate model from data

– take assign values as fact
– collect counts (weighted by probabilities)
– estimate model from counts

• Iterate these steps until convergence

Barry Haddow Machine Translation 6 February 2012

IBM alignment models

I IBM Model 1: q(j|i, `, m) = 1`

(All alignments are equally likely.)
IBM alignment models

I IBM Model 1: q(j|i, `, m) = 1`

(All alignments are equally likely.)
I IBM Model 2: q(j|i, `, m) = c(j|i,`,m)
c(i,`,m)
(Alignment probability is a parameter of the model.)
IBM Model 2

Add a prior probability on alignments q(ai |i, m, `).

I We estimate q along with t during the M-step.
I The joint probability
Q still decomposes across words:
P(e , a |f ) = i t(ei |fai )q(ai |i, `e , `f )
I But adding this parameter makes the likelihood non-convex.
I This means initialization affects the outcome.
I Initializing from IBM model 1 works well in practice.
31

IBM Model 2
1 2 3 4 5
natürlich ist das haus klein
lexical translation step
of course is the house small
alignment step
of course the house is small
1 2 3 4 5 6

Barry Haddow Machine Translation 6 February 2012

IBM alignment models

I IBM Model 1: q(j|i, `, m) = 1`

(All alignments are equally likely.)
I IBM Model 2: q(j|i, `, m) = c(j|i,`,m)
c(i,`,m)
(Alignment probability is a parameter of the model.)
IBM alignment models

I IBM Model 1: q(j|i, `, m) = 1`

IBM Model 3

Mary did not slap the green witch

n(3|slap)
Mary not slap slap slap the green witch
p-null
Mary not slap slap slap NULL the green witch
t(la|the)
Maria no daba una botefada a la verde bruja
d(4|4)
Maria no daba una bofetada a la bruja verde

Barry Haddow Machine Translation 6 February 2012

IBM alignment models

I IBM Model 1: q(j|i, `, m) = 1`

(All alignments are equally likely.)
I IBM Model 2: q(j|i, `, m) = c(j|i,`,m)
c(i,`,m)
(Alignment probability is a parameter of the model.)
I IBM Model 3
(model “fertility”, the number of alignments per word)
I IBM Models 4 and 5
I Condition on the alignment of the preceding word
I Like an HMM: P(ai |ai−1 , `, m)
IBM alignment models

I IBM Model 1: q(j|i, `, m) = 1`

The noisy channel

Estimation and alignment

Phrase-based translation

Decoding

Minimum Error Rate Training

Syntactic Machine Translation

Evaluation

Practicalities

Beyond Parallel Sentences

Problems with word-based translation

Word-based translation has obvious limitations:

I Multi-word alignments aren’t modeled well:
?
P(daba una botefada|slap) =
P(daba|slap)P(una|slap)P(botefada|slap)
I Many phrasal translations are non-compositional:
faire (make) le (the) menage (home) → clean up
I Alignment decisions for phrasal units should be made jointly:
la comida me gusta mucho → i like the food a lot
Problems with word-based translation

Word-based translation has obvious limitations:

Phrase-Based Model

• Foreign input is segmented in phrases

• Each phrase is translated into English
• Phrases are reordered

Barry Haddow Machine Translation 13 February 2012

Phrase Translation Table

• Main knowledge source: table with phrase translations and their probabilities

• Example: phrase translations for natuerlich

Translation Probability φ(ē|f¯)

of course 0.5
naturally 0.3
of course , 0.15
, of course , 0.05

Barry Haddow Machine Translation 13 February 2012

Linguistic Phrases?

• Model is not limited to linguistic phrases

(noun phrases, verb phrases, prepositional phrases, ...)

• Example non-linguistic phrase pair

spass am → fun with the

• Prior noun often helps with translation of preposition

• Experiments show that limitation to linguistic phrases hurts quality

Barry Haddow Machine Translation 13 February 2012

Learning a Phrase Translation Table

• Task: learn the model from a parallel corpus

• Three stages:
– word alignment: using IBM models or other method
– extraction of phrase pairs
– scoring phrase pairs

Barry Haddow Machine Translation 13 February 2012

Phrase-based translation
Typically, we start with a symmetrized set of word alignments
I Align e to f
I Align f to e (not generally the same!)
I Take the intersection
10

Phrase Extraction Criteria

Maria no daba Maria no daba Maria no daba

Mary Mary Mary

did did did

not not not

X
slap slap slap
X
consistent inconsistent inconsistent

• Phrase alignment has to contain all alignment points for all covered words

• Phrase alignment has to contain at least one alignment point

Barry Haddow Machine Translation 13 February 2012

Word alignment induced phrases bofetada bruja

Maria no daba una a la verde

Mary

did

not

slap

the

green

witch

(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green)

Barry Haddow Machine Translation 13 February 2012

Word alignment induced phrases bofetada bruja

Maria no daba una a la verde

Mary

did

not

slap

the

green

witch

Barry Haddow Machine Translation 13 February 2012

Word alignment induced phrases bofetada bruja

Maria no daba una a la verde

Mary

did

not

slap

the

green

witch

Barry Haddow Machine Translation 13 February 2012

Word alignment induced phrases bofetada bruja

Maria no daba una a la verde

Mary

did

not

slap

the

green

witch

(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green),
(Maria no, Mary did not), (no daba una bofetada, did not slap), (daba una bofetada a la, slap the),
(bruja verde, green witch), (Maria no daba una bofetada, Mary did not slap),
(no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch),
(Maria no daba una bofetada a la, Mary did not slap the),
(daba una bofetada a la bruja verde, slap the green witch)

Barry Haddow Machine Translation 13 February 2012

Word alignment induced phrases (5)

bofetada bruja
Maria no daba una a la verde

Mary

did

not

slap

the

green

witch

(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green),
(Maria no, Mary did not), (no daba una bofetada, did not slap), (daba una bofetada a la, slap the),
(bruja verde, green witch), (Maria no daba una bofetada, Mary did not slap),
(no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch),
(Maria no daba una bofetada a la, Mary did not slap the), (daba una bofetada a la bruja verde,
slap the green witch), (no daba una bofetada a la bruja verde, did not slap the green witch),
(Maria no daba una bofetada a la bruja verde, Mary did not slap the green witch)

Barry Haddow Machine Translation 13 February 2012

Scoring Phrase Translations

• Phrase pair extraction: collect all phrase pairs from the data

• Phrase pair scoring: assign probabilities to phrase translations

• Score by relative frequency:

count(ē, f¯)
φ(f¯|ē) = P ¯
f¯i count(ē, fi )

Barry Haddow Machine Translation 13 February 2012

Outline

The noisy channel

Estimation and alignment

Phrase-based translation

Decoding

Minimum Error Rate Training

Syntactic Machine Translation

Evaluation

Practicalities

Beyond Parallel Sentences

Decoding Process
Maria no dio una bofetada a la bruja verde

• Build translation left to right

– select foreign words to be translated

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Decoding Process
Maria no dio una bofetada a la bruja verde

Mary

• Build translation left to right

– select foreign words to be translated
– find English phrase translation
– add English phrase to end of partial translation

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Decoding Process
Maria no dio una bofetada a la bruja verde

Mary

• Build translation left to right

– select foreign words to be translated
– find English phrase translation
– add English phrase to end of partial translation
– mark foreign words as translated

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Decoding Process
Maria no dio una bofetada a la bruja verde

Mary did not

• One to many translation

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Decoding Process
Maria no dio una bofetada a la bruja verde

Mary did not slap

• Many to one translation

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Decoding Process
Maria no dio una bofetada a la bruja verde

Mary did not slap the

• Many to one translation

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Decoding Process
Maria no dio una bofetada a la bruja verde

Mary did not slap the green

• Reordering

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Decoding Process
Maria no dio una bofetada a la bruja verde

Mary did not slap the green witch

• Translation finished

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Translation Options
Maria no dio una bofetada a la bruja verde

Mary not give a slap to the witch green

did not a slap by green witch
no slap to the
did not give to
the
slap the witch

• Look up possible phrase translations

– many different ways to segment words into phrases
– many different ways to translate each phrase

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Hypothesis Expansion
Maria no dio una bofetada a la bruja verde

Mary not give a slap to the witch green

did not a slap by green witch
no slap to the
did not give to
the
slap the witch

e:
f: ---------
p: 1

• Start with empty hypothesis

– e: no English words
– f: no foreign words covered
– p: probability 1

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Hypothesis Expansion
Maria no dio una bofetada a la bruja verde

Mary not give a slap to the witch green

did not a slap by green witch
no slap to the
did not give to
the
slap the witch

e: e: Mary
f: --------- f: *--------
p: 1 p: .534

• Pick translation option

• Create hypothesis
– e: add English phrase Mary
– f: first foreign word covered
– p: probability 0.534

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

A Quick Word on Probabilities

• Not going into detail here, but...
• Translation Model
– phrase translation probability p(Mary|Maria)
– reordering costs
– phrase/word count costs
– ...
• Language Model
– uses trigrams:
– p(Mary did not) =
p(Mary|START) ×p(did|Mary,START) × p(not|Mary did)

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Hypothesis Expansion
Maria no dio una bofetada a la bruja verde

Mary not give a slap to the witch green

did not a slap by green witch
no slap to the
did not give to
the
slap the witch

e: witch
f: -------*-
p: .182

e: e: Mary
f: --------- f: *--------
p: 1 p: .534

• Add another hypothesis

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Hypothesis Expansion
Maria no dio una bofetada a la bruja verde

Mary not give a slap to the witch green

did not a slap by green witch
no slap to the
did not give to
the
slap the witch

e: witch e: ... slap

f: -------*- f: *-***----
p: .182 p: .043

e: e: Mary
f: --------- f: *--------
p: 1 p: .534

• Further hypothesis expansion

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Hypothesis Expansion
Maria no dio una bofetada a la bruja verde

Mary not give a slap to the witch green

did not a slap by green witch
no slap to the
did not give to
the
slap the witch

e: witch e: slap
f: -------*- f: *-***----
p: .182 p: .043

e: e: Mary e: did not e: slap e: the e:green witch

f: --------- f: *-------- f: **------- f: *****---- f: *******-- f: *********
p: 1 p: .534 p: .154 p: .015 p: .004283 p: .000271

• ... until all foreign words covered

– find best hypothesis that covers all foreign words
– backtrack to read off translation

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Hypothesis Expansion
Maria no dio una bofetada a la bruja verde

Mary not give a slap to the witch green

did not a slap by green witch
no slap to the
did not give to
the
slap the witch

e: witch e: slap
f: -------*- f: *-***----
p: .182 p: .043

e: e: Mary e: did not e: slap e: the e:green witch

f: --------- f: *-------- f: **------- f: *****---- f: *******-- f: *********
p: 1 p: .534 p: .154 p: .015 p: .004283 p: .000271

• Adding more hypothesis

⇒ Explosion of search space

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Explosion of Search Space

• Number of hypotheses is exponential with respect to sentence length
⇒ Decoding is NP-complete [Knight, 1999]
⇒ Need to reduce search space
– risk free: hypothesis recombination
– risky: histogram/threshold pruning

Koehn, Univ. of Edinburgh Phrase-Based and Factored SMT 25 June 2008

Outline

The noisy channel

Estimation and alignment

Phrase-based translation

Decoding

Minimum Error Rate Training

Syntactic Machine Translation

Evaluation

Practicalities

Beyond Parallel Sentences

Freedom from the rusty manacles of probability

I In the noisy channel model, decoding maximizes:

log P(f |e) + log P(e)

Freedom from the rusty manacles of probability

I In the noisy channel model, decoding maximizes:

log P(f |e) + log P(e)

I We might decide these components are not equally important:

λ1 log P(f |e) + λ2 log P(e)

Freedom from the rusty manacles of probability

I In the noisy channel model, decoding maximizes:

log P(f |e) + log P(e)

I We might decide these components are not equally important:

λ1 log P(f |e) + λ2 log P(e)

I But this is just a log-linear model.

Why not add other features?

λ1 log P(f |e) + λ2 log P(e) + λ3 log P(e|f ) + . . .

Minimum error-rate training (MERT)

Our new objective:

λ1 log P(f |e) + λ2 log P(e) + λ3 log P(e|f ) + . . .

I How to set λ?
Maximize Bleu score on dev-set.
Minimum error-rate training (MERT)

Our new objective:

λ1 log P(f |e) + λ2 log P(e) + λ3 log P(e|f ) + . . .

I How to set λ?
Maximize Bleu score on dev-set.
I This will help you get a higher Bleu
score on test data. ,
I But how much do we really trust Bleu?
I We can’t get anything like a gradient of
the Bleu score with respect to λ, so
learning is difficult, especially with many
features. /
I (But see Galley et al EMNLP 2013 for
some progress)
Outline

The noisy channel

Estimation and alignment

Phrase-based translation

Decoding

Minimum Error Rate Training

Syntactic Machine Translation

Evaluation

Practicalities

Beyond Parallel Sentences

The Vauquois Triangle

Is translation easier at the syntactic level?

la empresa tiene enemigos fuertes en Europa .
the company has strong enemies in Europe .
Garcia and associates . the clients and the associates are enemies .

Garcia y asociados . los clientes y los asociados son enemigos .

Carlos Garcia has three associates . the company has three groups .

Carlos Garcia tiene tres asociados . la empresa tiene tres grupos .

his associates are not strong . its groups are in Europe .

sus asociados no son fuertes . sus grupos estan en Europa .

Garcia has a company also . the modern groups sell strong pharmaceuticals .

Garcia tambien tiene una empresa . los grupos modernos venden medicinas fuertes .
its clients are angry . the groups do not sell zanzanine .

sus clientes estan enfadados . los grupos no venden zanzanina .

the associates are also angry . the small groups are not modern .

Garcia y asociados . los clientes y los asociados son enemigos .

Carlos Garcia has three associates . the company has three groups .

Carlos Garcia tiene tres asociados . la empresa tiene tres grupos .

his associates are not strong . its groups are in Europe .

sus asociados no son fuertes . sus grupos estan en Europa .

Garcia has a company also . the modern groups sell strong pharmaceuticals .

Garcia tambien tiene una empresa . los grupos modernos venden medicinas fuertes .
its clients are angry . the groups do not sell zanzanine .

sus clientes estan enfadados . los grupos no venden zanzanina .

the associates are also angry . the small groups are not modern .

los asociados tambien estan enfadados . los grupos pequenos no son modernos .
la empresa tiene enemigos fuertes en Europa .
the company has strong enemies in Europe .
Garcia and associates . the clients and the associates are enemies .
Same pattern:
Garcia y asociados . los clientes y los asociados son enemigos .
NN JJ
Carlos Garcia has three associates . → JJtheNN company has three groups .

Carlos Garcia tiene tres asociados . la empresa tiene tres grupos .

his associates are not strong . its groups are in Europe .

sus asociados no son fuertes . sus grupos estan en Europa .

Garcia has a company also . the modern groups sell strong pharmaceuticals .

Garcia tambien tiene una empresa . los grupos modernos venden medicinas fuertes .
its clients are angry . the groups do not sell zanzanine .

sus clientes estan enfadados . los grupos no venden zanzanina .

the associates are also angry . the small groups are not modern .

Carlos Garcia tiene tres asociados .la empresa tiene tres grupos .
Finite-state models do
his associates are not strong .
not capture
its groups are in Europe .
this generalization.
sus asociados no son fuertes . sus grupos estan en Europa .
Garcia has a company also . the modern groups sell strong pharmaceuticals .

Garcia tambien tiene una empresa . los grupos modernos venden medicinas fuertes .
its clients are angry . the groups do not sell zanzanine .

sus clientes estan enfadados . los grupos no venden zanzanina .

the associates are also angry . the small groups are not modern .

los asociados tambien estan enfadados . los grupos pequenos no son modernos .
Let’s do an example
Context-Free Grammar
Context-Free Grammar

S → NP VP
NP → watashi wa
NP → hako wo
VP → NP V
V → akemasu
Context-Free Grammar

S
S → NP VP
NP → watashi wa
NP → hako wo
VP → NP V
V → akemasu
Context-Free Grammar

S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
V → akemasu
Context-Free Grammar

S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa
V → akemasu
Context-Free Grammar

S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu
Context-Free Grammar

S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu

hako wo
Context-Free Grammar

S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu

hako wo
Context-Free Grammar

S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu

hako wo akemasu
Context-Free Grammar

S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu

hako wo akemasu
Context-Free Grammar

S
S → NP VP
NP → watashi wa NP VP
NP → hako wo
VP → NP V
watashi wa NP V
V → akemasu

hako wo akemasu
watashi wa hako wo akemasu
Synchronous Context-Free Grammar

S → NP VP
NP → watashi wa
NP → hako wo
VP → NP V
V → akemasu
Synchronous Context-Free Grammar

S → NP VP S → NP VP
NP → watashi wa NP → I
NP → hako wo NP → the box
VP → NP V VP → V NP
V → akemasu V → open
Synchronous Context-Free Grammar

S → NP1 VP2 / NP1 VP2

NP → watashi wa / I
NP → hako wo / the box
VP → NP1 V2 / V1 NP2
V → akemasu / open
Synchronous Context-Free Grammar