0% au considerat acest document util (0 voturi)

152 vizualizări148 pagini

Expectation Maximization Algorithm

Documentul prezintă algoritmul EM (Expectation-Maximization) pentru estimarea parametrilor unui model probabilistic care include atât variabile observabile cât și neobservabile. Algoritmul maximizează o margine inferioară a verosimilității datelor observabile printr-o metodă iterativă de creștere pe coordonate asupra distribuției variabilelor neobservabile și a parametrilor modelului.

Încărcat de

Eduard Dănilă

Drepturi de autor

Respectăm cu strictețe drepturile privind conținutul. Dacă suspectați că acesta este conținutul dumneavoastră, reclamați-l aici.

Formate disponibile

Descărcați ca PDF, TXT sau citiți online pe Scribd

0% au considerat acest document util (0 voturi)

152 vizualizări148 pagini

Expectation Maximization Algorithm

Încărcat de

Eduard Dănilă

Drepturi de autor

Respectăm cu strictețe drepturile privind conținutul. Dacă suspectați că acesta este conținutul dumneavoastră, reclamați-l aici.

Formate disponibile

Descărcați ca PDF, TXT sau citiți online pe Scribd

0.

The Expectation-Maximization
(EM) Algorithm
1.

Algoritmul EM, fundamentare teoretică:

pasul E [şi pasul M]
CMU, 2008 fall, Eric Xing, HW4, pr. 1.1-3
2.

Algoritmul EM (Expectation-Maximization) permite crearea unor

modele probabiliste care pe de o parte depind de un set de parametri
θ iar pe de altă parte includ pe lângă variabilele obişnuite (“observabile”
sau “vizibile”) x şi variabile necunoscute (“neobservabile”, “ascunse” sau
“latente”) z.
În general, ı̂n astfel de situaţii/modele, nu se poate face ı̂n mod direct
o estimare a parametrilor modelului (θ), ı̂n aşa fel ı̂ncât să se garanteze
atingerea maximului verosimilităţii datelor observabile x.
În schimb, algoritmul EM procedează ı̂n manieră iterativă, constituind ast-
fel o modalitate foarte convenabilă de estimare a parametrilor θ.
Definim log-verosimilitatea datelor observabile (x) ca fiind log P (x | θ), iar
log-verosimilitatea datelor complete (observabile, x, şi neobservabile, z) ca
fiind log P (x, z | θ).
Observaţie: Pe tot parcursul acestui exerciţiu se va considera funcţia log
ca având baza supraunitară, fixată.
3.
a. Log-verosimilitatea datelor observabile (x) se poate exprima ı̂n funcţie
de datele neobservabile (z), astfel:a
!
not.
X
ℓ(θ) = log P (x | θ) = log P (x, z | θ)
z

În continuare vom nota cu q o funcţie / distribuţie de probabilitate definită

peste variabilele ascunse/neobservabile z.
Folosiţi inegalitatea lui Jensen pentru a demonstra că are loc următoarea
inegalitate:

X P (x, z | θ)
log P (x | θ) ≥ q(z) log (1)
z
q(z)

pentru orice x (fixat), pentru orice valoare a parametrului θ şi pentru orice
distribuţie probabilistă q definită peste variabilele neobservabile z.
a Reţineţi
că x, vectorul de date observabile, este fixat (dat), ı̂n vreme ce z, vectorul de date neobservabile, este
liber (variabil).
4.
Observaţie (1)
Semnificaţia inegalităţii (1) este următoarea:
Funcţia (de fapt, orice funcţie de forma)

def.
X P (x, z | θ)
F (q, θ) = q(z) log
z
q(z)

constituie o margine inferioară pentru funcţia de log-verosimilitate a

not.
datelor incomplete / observabile, ℓ(θ) = log P (x | θ).
Remarcaţi faptul că F este o funcţie de două variabile, iar prima variabilă
nu este de tip numeric (cum este θ), ci este de tip funcţional.
Mai mult, se observă că expresia funcţiei F este de fapt o medie,

P (x, z | θ)
Eq(z) log ,
q(z)
atunci când x, q şi parametrul θ se consideră fixaţi, iar z este lăsat să
varieze.
5.
Soluţie
Inegalitatea lui Jensen, ı̂n contextul teoriei probabilităţilor:
considerând X este o variabilă aleatoare (unară),
dacă ϕ este o funcţie (reală) convexă, atunci ϕ(E[X]) ≤ E[ϕ(X)];
dacă ϕ este funcţie concavă, atunci ϕ(E[X]) ≥ E[ϕ(X)].
Aici vom folosi funcţia log cu bază supraunitară (funcţie concavă), deci aplicând inegal-
itatea lui Jensen vom obţine: log(E[X]) ≥ E[log(X)].
Log-verosimilitatea datelor observabile este:
! !
not.
X X P (x, z | θ)
ℓ(θ) = log P (x | θ) = log P (x, z | θ) = log q(z)
z z
q(z)

def. P (x, z | θ)
= log Eq(z)
q(z)

P (x, z | θ)
Conform inegalităţii lui Jensen (ı̂nlocuind X de mai sus cu ), rezultă:
q(z)

not. P (x, z | θ) def. X P (x, z | θ)
ℓ(θ) = log P (x | θ) ≥ Eq(z) log = q(z) log ,
q(z) z
q(z)
6.

Notaţie

În continuare, pentru a vă aduce mereu aminte că distribuţia q se referă
la datele neobservabile z, vom folosi notaţia q(z) ı̂n loc de q.
În consecinţă, ı̂n cele ce urmează, ı̂n funcţie de context, q(z) va desemna
fie distribuţia q, fie valoarea acestei distribuţii pentru o valoare oarecare
[a variabilei neobservabile] z.
(Este adevărat că această lejeră ambiguitate poate induce ı̂n eroare citi-
torul neexperimentat.)
7.

b. Vă reamintim definiţia entropiei relative (numită şi divergenţa

Kullback-Leibler):

X P (z | x, θ)
KL(q(z) || P (z | x, θ)) = − q(z) log
z
q(z)

Arătaţi că
log P (x | θ) = F (q(z), θ) + KL(q(z) || P (z | x, θ)).

Observaţie (2):
Semnificaţia egalităţii care trebuie demonstrată la acest punct este foarte
not.
interesantă: diferenţa dintre funcţia obiectiv ℓ(θ) = log P (x | θ) şi marginea
sa inferioară F (q(z), θ) — a se vedea punctul a — este KL(q(z) || P (z | x, θ)).
Tocmai pe această chestiune se va “construi” punctul final, şi cel mai
important, al problemei noastre.
8.
Observaţie (3)
Ideile de bază ale algoritmului EM sunt două:
1. În loc să calculeze maximul funcţiei de log-verosimilitate log P (x | θ) ı̂n
raport cu θ, algoritmul EM va maximiza marginea sa inferioară, F (q(z), θ),
ı̂n raport cu ambele argumente, q(z) şi θ.
2. Pentru a căuta maximul (de fapt, un maxim local al) marginii in-
ferioare F (q(z), θ), algoritmul EM aplică metoda creşterii pe coordonate
(engl., coordinate ascent): după ce iniţial se fixează θ(0) eventual aleatoriu,
se maximizează iterativ funcţia F (q(z), θ), ı̂n mod alternativ: mai ı̂ntâi ı̂n
raport cu distribuţia q(z) şi apoi ı̂n raport cu parametrul θ.

Pasul E: q (t) (z) = argmax F (q(z), θ(t) )

q(z)

Pasul M: θ(t+1) = argmax F (q (t) (z), θ)

θ
9.
Soluţie

def.
X P (x, z | θ)
F (q(z), θ) = q(z) log
z
q(z)

X P (z | x, θ) · P (x | θ)
= q(z) log
z
q(z)

X P (z | x, θ)
= q(z) log + log P (x | θ)
z
q(z)
X
X P (z | x, θ)
= q(z) log + q(z) log P (x | θ)
z
q(z) z
X
= −KL(q(z) || P (z | x, θ)) + log P (x | θ) · q(z)
z
| {z }
=1

⇒ log P (x | θ) = F (q(z), θ) + KL(q(z) || P (z | x, θ)).

10.

Observaţie (4)
Conform proprietăţii KL(p || q) ≥ 0 pentru ∀p, q, rezultă KL(q(z) || P (z |
x, θ)) ≥ 0.
Aşadar, din egalitatea care tocmai a fost demonstrată la punctul b obţinem
(din nou!, după rezultatul de la punctul a) că F (q(z), θ) este o margine
not.
inferioară pentru log-verosimilitatea datelor observabile, ℓ(θ) = log P (x | θ).
11.

c. Fie θ(t) valoarea obţinută pentru parametrul / parametrii θ la iteraţia t

a algoritmului EM. Considerând această valoare fixată, arătaţi că maximul
lui F ı̂n raport cu argumentul / distribuţia q(z) este atins pentru distribuţia
P (z | x, θ(t) ), iar valoarea maximului este:
max F (q(z), θ(t) ) = EP (z|x,θ(t)) [log P (x, z | θ(t) )] + H(P (z | x, θ(t) ))
q(z)
12.
Soluţie
Trebuie să maximizăm F (q(z), θ(t) ) — marginea inferioară a log-
verosimilităţii datelor observabile x — ı̂n raport cu distribuţia q(z).
Pe de o parte, rezultatul de la punctul a ne spune că F (q(z), θ) ≤ log P (x | θ),
pentru orice valoare a lui θ; ı̂n particular, pentru θ(t) avem

log P (x | θ(t) ) ≥ F (q(z), θ(t) )

Pe de altă parte, dacă ı̂n egalitatea demonstrată la punctul b se ı̂nlocuieşte
θ cu θ(t) , rezultă:
log P (x | θ(t) ) = F (q(z), θ(t) ) + KL(q(z)||P (z | x, θ(t) ))

În fine, dacă alegem q(z) = P (z | x, θ(t) ), atunci termenul KL(q(z)||P (z |

x, θ(t) )) din dreapta egalităţii de mai sus devine zero (vezi exerciţiul [PS-31]
de la capitolul de Probabilităţi şi statistică).
Aşadar, valoarea maxq(z) F (q(z), θ(t) ) se obţine pentru distribuţia q(z) = P (z |
x, θ(t) ).
13.

Acum vom calcula această valoare maximă:

= Q(θ(t) | θ(t) ) + H[P (z | x, θ(t) )]

not.
unde Q(θ | θ(t) ) = EP (z|x,θ(t) ) [log P (x, z | θ)].
14.

Observaţie (5)
def.
Notând Gt (θ) = F (P (z | x, θ(t) ), θ), din calculul de mai sus rezultă că
Gt (θ(t) ) = log P (x | θ(t) ) = Q(θ(t) | θ(t) ) + H(P (z | x, θ(t) )). Se poate demon-
stra uşor — procedând similar cu calculul de mai sus — egalitatea

Gt (θ) = Q(θ | θ(t) ) + H[P (z | x, θ(t) )]

Observând că termenul H[P (z | x, θ(t) )] din această ultimă egalitate nu de-
pinde de θ, rezultă imediat că
argmax Gt (θ) = argmax Q(θ | θ(t) )
θ θ
În consecinţă,
def.
θ(t+1) = argmax F (P (z | x, θ(t) ), θ) = argmax Gt (θ) = argmax Q(θ | θ(t) )
θ θ θ
15.

From:
What is the expectation maximiza-
tion algorithm?
Chuong B. Do, Serafim Batzoglou,
Nature Biotechnology,
vol. 26, no. 8, 2008, pp. 897-899
16.

Egalitatea precedentă este responsabilă pentru următoarea reformulare

(cea uzuală!) a algoritmului EM:

Pasul E′ : calculează Q(θ | θ(t) ) = EP (z|x,θ(t) ) [log P (x, z | θ)]

Pasul M′ : calculează θ(t+1) = argmax Q(θ | θ(t) )
θ
17.

Algoritmul EM:
corectitudine / convergenţă

prelucrare de Liviu Ciortuz, după

[Link]/wiki/Expectation-maximization
18.

Pentru a maximiza funcţia de log-verosimilitate a datelor observabile, i.e.

def.
ℓ(θ) = log P (x | θ), unde baza logaritmului (nespecificată) este considerată
supraunitară, algoritmul EM procedează ı̂n mod iterativ, optimizând la
pasul M al fiecărei iteraţii (t) o funcţie “auxiliară”
def.
Q(θ | θ(t) ) = EP (z|x,θ(t) ) [log P (x, z | θ)],

reprezentând media log-verosimilităţii datelor complete (observabile şi

neobservabile) ı̂n raport cu distribuţia condiţională P (z | x, θ(t) ).
Vom considera iteraţiile t = 0, 1, . . . şi θ(t+1) = argmaxθ Q(θ | θ(t) ), cu θ(0) ales
ı̂n mod arbitrar.
Demonstraţi că pentru orice t fixat (arbitrar) şi pentru orice θ astfel ı̂ncât
Q(θ | θ(t) ) ≥ Q(θ(t) | θ(t) ) are loc inegalitatea:

log P (x | θ) − log P (x | θ(t) ) ≥ Q(θ | θ(t) ) − Q(θ(t) | θ(t) ) (2)

19.
Observaţii
1. Semnificaţia imediată a relaţiei (2):
Orice ı̂mbunătăţire a valorii funcţiei auxiliare Q(θ | θ(t) ) conduce la o
ı̂mbunătăţire cel puţin la fel de mare a valorii funcţiei obiectiv, ℓ(θ).
2. Dacă ı̂n inegalitatea (2) se ı̂nlocuieşte θ cu θ(t+1) = argmaxθ Q(θ | θ(t) ), va
rezulta

log P (x | θ(t+1) ) ≥ log P (x | θ(t) ). log P(x| θ)

În final, vom avea log P(x| θ (t+1) )
ℓ(θ(0) ) ≤ ℓ(θ(t) ) ≤ ℓ(θ(t+1) ) ≤ . . . .
log P(x| θ (t))
}
Şirul acesta (monoton) este
Q (θ |θ (t) )
(t+1)
mărginit superior de 0 (vezi
definiţia lui ℓ), deci converge la Q (θ (t) |θ(t) )
}
o anumită valoare ℓ∗ . În anumite Q (θ|θ (t) )
cazuri / condiţii, această valoare
este un maxim (ı̂n general, local)
al funcţiei de log-verosimilitate. θ (t) θ (t+1) θ
20.

Observaţii (cont.)
3. Conform aceleiaşi inegalităţi (2), la pasul M de la iteraţia t a algoritmului
EM, este suficient ca ı̂n loc să se ia θ(t+1) = argmaxθ Q(θ | θ(t) ), să se aleagă
θ(t+1) astfel ı̂ncât Q(θ(t+1) | θ(t) ) > Q(θ(t) | θ(t) ). Aceasta constituie versiunea
“generalizată” a algoritmului EM.
21.
Demonstraţia relaţiei (2)
P (x, z | θ) = P (z | x, θ) · P (x | θ) ⇒ log P (x | θ) = log P (x, z | θ) − log P (z | x, θ) ⇒
X
P (z | x, θ(t) ) · log P (x | θ) =
z
| {z }
1
X X
P (z | x, θ(t) ) · log P (x, z | θ) − P (z | x, θ(t) ) · log P (z | x, θ) ⇒
z z
X
log P (x | θ) = Q(θ | θ(t) ) − P (z | x, θ(t) ) · log P (z | x, θ)
z

Ultimul termen din egalitatea aceasta reprezintă o cross-entropie, pe care o vom nota
cu CH(θ | θ(t) ). Aşadar,

log P (x | θ) = Q(θ | θ(t) ) + CH(θ | θ(t) )

Această egalitate este valabilă pentru toate valorile posibile ale parametrului θ. În
particular pentru θ = θ(t) , vom avea:

log P (x | θ(t) ) = Q(θ(t) | θ(t) ) + CH(θ(t) | θ(t) )

22.

Demonstraţia relaţiei (2), cont.

Scăzând membru cu membru ultimele două egalităţi, obţinem:

Conform inegalităţii lui Gibbs, avem CH(θ | θ(t) ) ≥ CH(θ(t) | θ(t) ), deci ı̂n final rezultă:

log P (x | θ) − log P (x | θ(t) ) ≥ Q(θ | θ(t) ) − Q(θ(t) | θ(t) )

23.

log P(x| θ )

G t+1 (θ (t+1) )
} H [P(z|x; θ (t+1))]
Gt (θ (t) )
Q (θ | θ(t+1) )
} Q (θ|θ (t) )

H[ P(z|x; θ (t) ) ]

θ (t) θ (t+1) θ (t+2) θ

24.

Using the EM algorithm for

learning a categorical distribution
[and implicitly a multinomial distribution]
Application to the ABO blood model

Liviu Ciortuz, following

Anirban DasGupta, Probability for Statistics and Machine
Learning, Springer, 2011, Ex. 20.10
25.
Grupele sangvine ale oamenilor sunt determinate de variantele (,,alelele“)
unei gene situate pe cromozomul 9 (mai exact, la poziţia 9q34.2), numită
gena ABO. Se ştie că fiecare dintre noi dispunem de câte o pereche de
astfel de cromozomi, deci de câte două copii ale genei ABO, şi anume una
moştenită de la tată şi una moştenită de la mamă.
grupe alele
Se notează cu A, B şi O cele trei tipuri de alele sangvine moştenite
ale genei ABO. Alelele A şi B sunt dominante
A AA
ı̂n raport cu alela O. (Altfel spus, alela O este
AO
recesivă ı̂n raport cu fiecare dintre alelele A şi
B BB
B.) Alelele A şi B sunt codominante. Prin ur-
BO
mare, grupele sangvine pot fi specificate conform
AB AB
tabelului alăturat.
O OO
Presupunem că, ı̂ntr-o populaţie oarecare, fiecare dintre alelele A, B şi
O are la bărbaţi aceeaşi frecvenţă ca şi la femei. În cele ce urmează,
aceste frecvenţe (notate respectiv cu pA , pB şi pO ) vor fi considerate a
priori necunoscute, dar urmează să le determinăm.
26.

Location of the ABO gene on

chromosome 9 The ABO protein

Source: wiki
Source: wiki
27.

ch. 9 ch. 9’

pheno− blood
type type Probability counts
nAA AA pA 2 +
probabilities alleles alleles A nA
AO 2p p
p A O
A A A
B
ABO ABO nBB BB pB 2 +
A B B B nB
gene gene BO 2p p
p O O B O
O
AB AB 2p p
A B
nAB
homologous chromosomes
O O pO2 nO

parameters unobservable data observable data

Output EM algorithm Input

28.

a. Considerăm că ı̂ntr-un eşantion de populaţie format din n per-

soane sunt nA persoane cu grupa sangvină A, nB persoane cu grupa
sangvină B, nAB persoane cu grupa sangvină AB şi nO persoane cu
grupa sangvină O. (Evident, n = na + nB + nAB + nO .)

Pornind de la aceste date ,,observabile“, să se deriveze algoritmul

EM pentru determinarea probabilităţilor pA , pB şi pO . (Evident,
ı̂ntrucât suma lor este 1, va fi suficient să se determine doar două
dintre ele.)
29.

Indicaţii

1. Presupunând că procesul de asociere a alelelor moştenite de

către un individ de la părinţii lui respectă proprietăţile specifice
evenimentelor aleatoare independente, rezultă că probabilităţile
de realizare a combinaţiilor (ı̂n termeni genetici: ,,fenotipurile“)
AA, AO, BB, BO, AB şi OO ı̂ntr-o populaţie oarecare sunt p2A ,
2pA pO , p2B , 2pB pO , 2pA pB , şi respectiv p2O .

2. Considerând nA = nAA +nAO şi nB = nBB +nBO , unde semnificaţiile

numerelor nAA , nAO , nBB şi nBO sunt similare cu semnificaţiile nu-
merelor nA , nB , nAB , şi nO care au fost precizate mai sus, este
natural ca ı̂n formularea algoritmului EM datele nAA şi nBB să fie
considerate ,,neobservabile“. Ca parametri ai modelului, se vor
considera probabilităţile pA , pB şi pO .
30.

3. La pasul E al algoritmului EM veţi calcula n̂AA şi n̂BB ,

care reprezintă respectiv numărul ,,aşteptat“ de apariţii ale
combinaţiei de alele AA şi numărul ,,aşteptat“ de apariţii ale
combinaţiei de alele BB ı̂n populaţia dată.

Veţi scrie apoi funcţia de log-verosimilitate a datelor com-

plete (,,observabile“ şi ,,neobservabile“), exprimată cu ajutorul
distribuţiei Multinomial(n; p2A , 2pA pO , p2B , 2pB pO , 2pA pB , p2O ).

4. La pasul M al algoritmului EM, pornind de la media funcţiei

de log-verosimilitate care a fost calculată la pasul E, veţi stabili
regulile de actualizare pentru probabilităţile pA , pB şi pO .
Atenţie: suma acestor probabilităţi fiind 1, problema de opti-
mizare pe care va trebui să o rezolvaţi la pasul M este una cu
restricţii. În acest sens, metoda multiplicatorilor lui Lagrange vă
poate fi de folos.
31.

Solution

Notations:
• parameters: p = {pA , pB , pO }
(in fact, it will be enough to estimate pA and pB , since pA + pB + pO = 1)
• observable data: nobs = {nA , nB , nAB , nO }
• unobservable data: nunobs = {nAA , nAO , nBB , nBO }
(in fact, we will restrict nunobs to {nAA , nBB })
• complete data: ncompl = nobs ∪ nunobs
• n = nA + nB + nAB + nO , nA = nAA + nAO , nB = nBB + nBO .

The likelihood of complete data:

not.
Lp) = P (ncompl |p) =
n!
· (p2A )nAA · (2 pA pO )nAO · (p2B )nBB · (2 pB pO )nBO · (2 pA pB )nAB · (p2O )nO
nAA ! nAO ! nBB ! nBO ! nAB ! nO !
32.
The likelihood function:
def.
ℓ(p) = ln L(p)
= ln c + nAA ln(p2A ) + nAO ln(2 pA pO ) + nBB ln(p2B ) + nBO ln(2 pB pO ) + nAB ln(2 pA pB ) + nO ln(p2O )
= ln c′ + 2 nAA ln pA + nAO (ln pA + ln pO ) +
2 nBB ln pB + nBO (ln pB + ln pO ) + nAB (ln pA + ln pB ) + 2 nO ln pO
= ln c′ + 2 nAA ln pA + (nA − nAA )(ln pA + ln pO ) +
2 nBB ln pB + (nB − nBB )(ln pB + ln pO ) + nAB (ln pA + ln pB ) + 2 nO ln pO ,

where c and c′ are constants which do not depend on the p parameter.

The “auxiliary” function:
def.
Q(p|p(t) ) = E[ℓ(p)|nobs ; p(t) ]
= ln c + 2 n̂AA ln pA + (nA − n̂AA )(ln pA + ln pO ) +
2 n̂BB ln pB + (nB − n̂BB )(ln pB + ln pO ) + nAB (ln pA + ln pB ) + 2 nO ln pO ,

where
not. (t) (t) (t)
n̂AA = E[nAA |nobs ; p(t) ] = E[nAA |nA , nB , nAB , nO ; pA , pB , pO ]
not. (t) (t) (t)
n̂BB = E[nBB |nobs ; p(t) ] = E[nBB |nA , nB , nAB , nO ; pA , pB , pO ]
33.
E step: As indicated by the expression of Q, here we have to compute n̂AA
and n̂BB the expected number of individuals having the phenotype (aka,
the allele pair) AA, and respectively BB.
Firstly, taking into account that nAA (or, equivalently, the phenotype AA),
when seen as a random variable, follows a binomial distribution of param-
(t)
(pA )2
eters nA and (t) (t) (t)
, its expectation will be:
2
(pA ) + 2 pA pO
not. (t) (t) (t)
n̂AA = E[nAA |nA , nB , nAB , nO ; pA , pB , pO ]
(t)
(pA )2
= (t) (t) (t)
· nA
(pA )2 +2 pA pO
Similarly,
not. (t) (t) (t)
n̂BB = E[nBB |nA , nB , nAB , nO ; pA , pB , pO ]
(t)
(pB )2
= (t) (t) (t)
· nB
(pB )2 +2 pB pO
34.
M step:
Given the constraint pA + pB + pO = 1, we will introduce the Lagrange vari-
able/“multiplier” λ and solve for the optimisation problem
not. (t+1) (t+1) (t+1)
p(t+1) = (pA , pB , pO )
(t) (t) (t)
= argmax [Q(pA , pB , pO |pA , pB , pO ) + λ(1 − (pA + pB + pO ))]
pA , pB , pO

= argmax [ln c′ + 2 n̂AA ln pA + (nA − n̂AA )(ln pA + ln pO ) + 2 n̂BB ln pB +

pA , pB , pO
argmax (nB − n̂BB )(ln pB + ln pO ) + nAB (ln pA + ln pB ) + 2 nO ln pO +
pA , pB , pO
argmax λ(1 − (pA + pB + pO ))]
pA , pB , pO
Taking the partial derivatives of the objective function w.r.t. pA , pB and pO
and solving them leads to:
1 1
(2 n̂AA + nA − n̂AA + nAB ) − λ = 0 ⇒ p̂A = (n̂AA + nA + nAB )
pA λ
1 1
(2 n̂BB + nB − n̂BB + nAB ) − λ = 0 ⇒ p̂B = (n̂BB + nB + nAB )
pB λ
1 1
(nA − n̂AA + nB − n̂BB + 2 nO ) − λ = 0 ⇒ p̂O = (nA − n̂AA + nB − n̂BB + 2nO )
pO λ
35.
Now, enforcing the constraint p̂A + p̂B + p̂O = 1 gives
1
(n̂AA + nA + nAB + n̂BB + nB + nAB + nA − n̂AA + nB − n̂BB + 2nO ) = 1 ⇔
λ
2
(nA + nB + nAB + nO ) = 1
λ
1
Since n = nA + nB + nAB + nO , it folows immediately that 2n = 1.
λ
Therefore λ = 2n.
Together with the results obtained on the previous slide, this leads
to the following updating relations:
(t+1) 1
p̂A = (n̂AA + nA + nAB )
2n
(t+1) 1
p̂B = (n̂BB + nB + nAB )
2n
(t+1) 1 1
p̂O = (nA − n̂AA + nB − n̂BB + 2nO ) = (nAO + nBO + 2nO )
2n 2n
36.

b. Implementaţi algoritmul EM pe care l-aţi conceput la punctul

a şi rulaţi-l pentru input-ul nA = 186, nB = 38, nAB = 13, nO = 284
(n = 521).

Ca valori iniţiale pentru parametri, veţi lucra mai ı̂ntâi cu pA =

1
pB = pO = , iar apoi cu pA = pO = 0.1 şi pB = 0.98.
3
(t) (t) (t)
Pentru oprire, veţi cere ca pA , pB şi pO să nu difere (fiecare ı̂n
parte) cu mai mult de 10−4 faţă de valorile calculate la iteraţia
precedentă. Comparaţi rezultatele obţinute pentru fiecare din cele
două iniţializări.
37.

Starting values: Starting values:

pA = pB = pO = 1/3. pA = pO = 0.1, pB = 0.98.
Iterations: Iterations:
t pA pB pC t pA pB pC
1 0.2505 0.0611 0.6884 1 0.2505 0.0847 0.6648
2 0.2185 0.0505 0.7311 2 0.2193 0.0511 0.7296
3 0.2142 0.0502 0.7357 3 0.2143 0.0502 0.7355
4 0.2137 0.0501 0.7362 4 0.2137 0.0501 0.7362
5 0.2136 0.0501 0.7363 5 0.2136 0.0501 0.7363
6 0.2136 0.0501 0.7363 6 0.2136 0.0501 0.7363
Note: The final results are the same.
38.

Using the EM algorithm for

learning a multinomial distribution
which depends on a single parameter
Application to a bioinformatics task:
computing probabilities for two linked bi-allelic loci
in haploid cells

Liviu Ciortuz, following

Brani Vidakovic, Georgia Institute of Technology,
Bayesian Statistics course (ISyE 8843A), 2004,
Handout 12, sec. 1.2.1
39.
Biological background: Meiosis

Source: wiki
40.
Biological background (cont’d)
P P
AB (1−p)/2 AB (1−p’)/2
Ab p/2 Ab p’/2
aB p/2 aB p’/2
ab (1−p)/2 ab (1−p’)/2
male/father diploid cell female/mother diploid cell

offspring diploid cell

ch. i ch. i ch. i ch. i

ch. i ch. i
A/a A/a A/a A/a
B/b B/b B/b B/b
A/a A/a
B/b B/b
meiosis meiosis
(including (including
crossover) crossover)

homologous chromosomes homologous chromosomes

haploid cell haploid cell

fecundation
41.
Biological background (cont’d)
Note: Alleles A and B are dominant w.r.t. a and respectively b.
Genotype:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
AA Aa aA AA Aa aA AA Aa aA AA Aa aA aa aa aa aa
BB BB BB Bb Bb Bb bB bB bB bb bb bb BB Bb bB bb
| {z }| {z }| {z } | {z }
AB Ab aB ab
nAB nAb naB nab
2+ψ 1−ψ 1−ψ ψ
4 4 4 4
not.
with ψ = (1 − p)(1 − p′ ).
We have designated in magenta colour the offspring phenotype, in blue the counts, i.e.
number of individuals of having a certain phenotype in a given population, and in read
the corresponding probabilities.
Note that
(1 − p)(1 − p′ )
P (ab) = and
4
p p′ p 1 − p′ 1 − p′ p 1 − (1 − p)(1 − p′ )
P (Ab) = P (aB) = · + · + · = .
2 2 2 2 2 2 4
42.

Fie o variabilă aleatoare X care iavalori ı̂n mulţimea {v1 , v2 , v3 , v4 }

2+ψ 1−ψ 1−ψ ψ
şi urmează distribuţia Multinomial n; , , , , unde
4 4 4 4
ψ este un parametru cu valori ı̂n intervalul (0, 1). Cele patru prob-
abilităţi listate ı̂n definiţia acestei distribuţii multinomiale sunt ı̂n
corespondenţă directă cu valorile v1 , v2 , v3 şi v4 .

a. Presupunem că n1 , n2 , n3 şi n4 sunt numărul de ,,realizări“

ale valorilor v1 , v2 , v3 şi v4 ı̂n totalul celor n ,,observaţii“. Pentru
fixarea ideilor, vom considera n1 = 125, n2 = 18, n3 = 20 şi n4 = 34.
Calculaţi estimarea de verosimilitate maximă (MLE) a parametru-
lui ψ.
43.

Solution
By denoting n = n1 + n2 + n3 + n4,
2+ψ 1−ψ 1−ψ ψ
and kowing that n ∼ Multinomial n; , , , ,
4 4 4 4
it follows that the verosimility function will be
n n n n4
def. n! 2+ψ 1 1−ψ 2 1−ψ 3 ψ
L(ψ) = P (D|ψ) = · · · · ,
n1 ! n2 ! n3 ! n4 ! 4 4 4 4

while
def.
ℓ(ψ) = ln L(ψ) = ln c + n1 ln(2 + ψ) + (n2 + n3 ) ln(1 − ψ) + n4 ln(ψ),

where c is constant w.r.t. ψ.

44.

∂ n1 n2 + n3 n4 n1 · ψ(1 − ψ) − (n2 + n3 ) · ψ(2 + ψ) + n4 · (2 + ψ)(1 − ψ)

ℓ(ψ) = − + =
∂ψ 2+ψ 1−ψ ψ (2 + ψ)(1 − ψ)ψ
n1 ψ − n1 ψ 2 − 2n2 ψ − n2 ψ 2 − 2n3 ψ − n3 ψ 2 + 2n4 − 2n4 ψ + n4 ψ − n4 ψ 2
=
(2 + ψ)(1 − ψ)ψ
−ψ 2 (n1 + n2 + n3 + n4 ) + ψ(n1 − 2n2 − 2n3 − n4 ) + 2n4
=
(2 + ψ)(1 − ψ)ψ
−nψ 2 + ψ(n1 − 2n2 − 2n3 − n4 ) + 2n4
=
(2 + ψ)(1 − ψ)ψ

Note that the denominator (2 + ψ)(1 − ψ)ψ is always positive.

By substituting n1 , n2 , n3 n4 for 125, 18, 20 and respectively 34, the nominator
becomes −197ψ 2 + 15ψ + 68.
By analysing the sign of this expression, it is easy to see that for our data
the function ℓ(ψ) reaches its maximum in the interval (0, 1) for
√ √
−15 + 152 + 4 · 197 · 68 −15 + 225 + 53809 −15 + 231.967
ψ̂ = = = = 0.626769036.
−2 · 197 −394 −394
45.

1 ψ 1 − ψ 1 − ψ ψ
b. Fie acum variabila aleatoare X ′ ∼ Multinomial n; , , , , . Valorile luate
2 4 4 4 4
de variabila X ′ , corespunzător acestor cinci probabilităţi, sunt (ı̂n ordine) v1 , v1 , v2 , v3 şi
v4 .
Observaţi că, ı̂n raport cu distribuţia multinomială de la punctul a, am ,,descompus“
2+ψ 1 ψ
probabilitatea ı̂n două probabilităţi, şi , ca şi cum ele ar corespunde unor
4 2 4
evenimente disjuncte.
Vom considera n11 , n12 , n2 , n3 şi respectiv n4 numărul de ,,realizări“ ale acestor valori,
ı̂nsă de data aceasta n11 şi n12 vor fi neobservabile. În schimb, vom furniza suma
n1 = n11 + n12 ca dată ,,observabilă“, alături de celelalte date observabile, n2 , n3 şi n4 .
Să se estimeze parametrul ψ folosind algoritmul EM. Cum este această estimare faţă
de valoarea obţinută la punctul a?

Indicaţie: Veţi elabora formulele corespunzătoare pasului E şi pasului M. Apoi veţi face
o implementare şi veţi rula algoritmul EM (pornind, de exemplu, cu valoarea iniţială
0.5 pentru ψ) până când valorile acestui parametru până la cea de-a şasea zecimală nu
se mai modifică.
46.
Solution: E-step
The verosimility function is now
n11 n12 n n n4
n! 1 ψ 1−ψ 2 1−ψ 3 ψ
P (D|ψ) = · · · · ·
n11 ! n12 ! n2 ! n3 ! n4 ! 2 4 4 4 4
n11 n2 +n3 n12 +n4
n! 1 1−ψ ψ
= · · · ,
n11 ! n12 ! n2 ! n3 ! n4 ! 2 4 4

while the log-verosimility function can be written as

ln P (D|ψ) = ln d + (n2 + n3 ) ln(1 − ψ) + (n12 + n4 ) ln(ψ),

where d is constant w.r.t. ψ.

Therefore, the ,,auxiliary“ function will be

Q(ψ|ψ (t+1) ) = En12 |n1 ,n2 ,n3 ,n4 ,ψ(t) [ln P (D|ψ)]
= ln d + (n2 + n3 ) ln(1 − ψ) + (n̂12 + n4 ) ln(ψ),

where
1
not. (t) 2 ψ (t)
n̂12 = E[n12 |n1 , n2 , n3 , n4 , ψ ] = = · n1 .
1
2 + ψ4 2 + ψ (t)
47.
M-step
∂ 1
E[ln P (D|ψ)] = −(n2 + n3 ) + (n̂12 + n4 )
∂ψ 1−ψ

∂2 −1 1
⇒ E[ln P (D|ψ)] = (n 1 + n 3 ) − (n̂ 12 + n 4 ) ≤0
∂ψ 2 n1 + n3 ψ2

⇒ the auxiliary function is concave.

∂
E[ln P (D|ψ)] = 0 ⇔
∂ψ
(n̂12 + n4 )(1 − ψ) = ψ(n2 + n3 ) ⇔
ψ(n̂12 + n2 + n3 + n4 ) = n̂12 + n4 ⇔
n̂12 + n4
ψ (t+1) =
n − n̂11
where
not. 2
n̂11 = E[n11 |n1 , n2 , n3 , n4 , ψ (t) ] = n1 − n̂12 = · n1 .
2 + ψ (t)
A Matlab implementation of this EM algorithm produces ψ = 0.62682139.
48.

The EM algorithm for solving

a Bernoulli mixture model
CMU, 2008 fall, Eric Xing, HW4, pr. 1.4-7
49.

Suppose I have two unfair coins. The first lands on heads with probability
p, and the second lands on heads with probability q.
Imagine n tosses, where for each toss I choose to use the first coin with
probability π and choose to use the second with probability 1 − π. The
outcome of each toss i is xi ∈ {0, 1}.
not.
Suppose I tell you the outcomes of the n tosses, x = {x1 , x2 , . . . , xn }, but I
don’t tell you which coins I used on which toss.
Given only the outcomes, x, your job is to compute estimates for θ which
is the set of all parameters, θ = {p, q, π} using the EM algorithm.
To compute these estimates, we will create a latent variable Z, where
zi ∈ {0, 1} indicates the coin used for the nth toss. For example z2 = 1
indicates the first coin was used on the second toss.
We define the “incomplete” data log-likelihood as log P (x|θ) and the “com-
plete” data log-likelihood as log P (x, z|θ).
50.

a. Show that E[zi |xi , θ] = P (zi = 1|xi , θ).

b. Use Bayes rule to compute P (zi = 1|xi , θ(t) ), where θ(t) denotes the pa-
rameters at iteration t.
c. Write down the complete log-likelihood, log P (x, z|θ).
d. E-Step: Show that the expected log-likelihood of the complete data
not.
Q(θ | θ(t) ) = EP (z|x,θ(t)) [log P (x, z | θ)] is given by
n
X
Q(θ | θ(t) ) = E[zi | xi , θ(t) ] · (log π + xi log p + (1 − xi ) log(1 − p)) +
i=1

+ 1 − E[zi | xi , θ(t) ] · (log(1 − π) + xi log q + (1 − xi ) log(1 − q))

e. M-Step: Describe the process you would use to obtain the update
equations for p(t+1) , q (t+1) şi π (t+1) .
51.

Solution
a.
X
E[zi | xi , θ] = zi P (zi | xi , θ) = 0 · P (zi = 0 | xi , θ) + 1 · P (zi = 1 | xi , θ)
z∈{0,1}

⇒ E[zi | xi , θ] = P (zi = 1 | xi , θ)

b.
P (zi = 1 | xi , θ)
P (xi | zi = 1, θ)P (zi = 1 | θ)
=
P (xi | zi = 1, θ)P (zi = 1 | θ) + P (xi | zi = 0, θ)P (zi = 0 | θ)
pxi · (1 − p)1−xi · π
= xi
p · (1 − p)1−xi · π + q xi · (1 − q)1−xi · (1 − π)
52.
c.
Note (in Romanian): Din punct de vedere metodologic, pentru a calcula P (x, z|θ)
ne putem inspira de la punctul precedent: observăm că la numitorul fracţiei care ne
dă valoarea lui P (zi |xi , θ) avem P (xi , zi = 1|θ) şi P (xi , zi = 0|θ). Pornind de la aceste două
expresii, ne vom propune să le exprimăm ı̂n mod unitar, adică sub forma unei singure
expresii.
n
Y n
Y
i.i.d.
log P (x, z | θ) = log P (xi , zi | θ) = log P (xi | zi , θ) · P (zi | θ)
i=1 i=1
Yn
xi 1−xi
zi xi 1−xi
1−zi
= log p (1 − p) π q (1 − q) (1 − π)
i=1
n
X
xi 1−xi
zi xi 1−xi
1−zi
= log p (1 − p) π q (1 − q) (1 − π)
i=1
Xn
xi 1−xi
xi 1−xi

= zi log p (1 − p) π + (1 − zi ) log q (1 − q) (1 − π)
i=1
53.

d.
not.
Q(θ | θ(t) ) = EP (z|x,θ(t) ) [log P (x, z | θ)]
n
X xi 1−xi

= EP (z|x,θ(t) ) [ zi log p (1 − p) π +
i=1
n
X
xi 1−xi
EP (z|x,θ(t) ) [ [ (1 − zi ) log q (1 − q) (1 − π) ]
n i=1
X
= E[zi | xi , θ(t) ] · log pxi (1 − p)1−xi π +
i=1
(t)
xi 1−xi

+ 1 − E[zi | xi , θ ] · q (1 − q) (1 − π)
n
X
= E[zi | xi , θ(t) ] · (log π + xi log p + (1 − xi ) log(1 − p)) +
i=1
(t)

+ 1 − E[zi | xi , θ ] · (log(1 − π) + xi log q + (1 − xi ) log(1 − q))
54.
e.
(t) not. (t) (p(t) )xi · (1 − p(t) )1−xi · π (t)
µi = E[zi | xi , θ ] = (t) x
(p ) i · (1 − p(t) )1−xi · π (t) + (q (t) )xi · (1 − q (t) )1−xi · (1 − π (t) )
n h
X (t)
⇒ Q(θ | θ(t) ) = µi (log π + xi log p + (1 − xi ) log(1 − p)) +
i=1
n
X i
(t)
[ + 1 − µi · (log(1 − π) + xi log q + (1 − xi ) log(1 − q))
i=1

n n n
∂Q(θ | θ(t) ) X (t) xi 1 − xi 1 X (t) 1 X (t)
=0⇔ µi − =0⇔ µ xi = µ (1 − xi )
∂p i=1
p 1−p p i=1 i 1 − p i=1 i
n n n n n
!
X (t)
X (t)
X (t)
X (t)
X (t)
⇔ (1−p) µi xi = p µi (1−xi ) ⇔ µi xi = p µi (1−xi ) + µi xi
i=1 i=1 i=1 i=1 i=1
n n Pn (t)
i=1 µi xi
X (t)
X (t)
⇔ µi xi =p µi ⇒ p(t+1) = Pn (t)
∈ [0, 1]
i=1 i=1 i=1 µi
55.

n Pn (t)
∂Q(θ | θ(t) ) X (t) xi 1 − xi (t+1) i=1 (1 − µi )xi
=0⇔ (1−µi ) − =0⇒q = Pn (t)
∈ [0, 1]
∂q i=1
q 1−q (1 − µ i )
i=1

n
! n n
(t) (t) (t)
∂Q(θ | θ ) X µi 1−
1 X (t) µi 1 X (t)
=0⇔ =0⇔ − µi = (1 − µi )
∂π i=1
π 1−π
π i=1 1 − π i=1
n n n n n
!
X (t)
X (t)
X (t)
X (t)
X (t)
⇔ (1 − π) µi = π (1 − µi ) ⇔ µi = π (1 − µi ) + µi
i=1 i=1 i=1 i=1 i=1
n n n n
X (t)
X X (t) 1 X (t)
⇔ µi =π 1⇔ µi = nπ ⇒ π (t+1) = µ ∈ [0, 1]
i=1 i=1 i=1
n i=1 i

∂2 ∂2 ∂2
Note: It can be easily shown that the second derivatives ( 2 , 2 and 2 ) are all
∂p ∂q ∂π
negative on the respective domains of definition. Therefore, the above solutions (p(t+1) ,
q (t+1) and π (t+1) ) maximize the Q functional.
56.

Estimarea parametrilor
unei mixturi de distribuţii [de tip] Bernoulli
A. când toate variabilele sunt observabile: MLE;
B. când unele variabile sunt neobservabile: algoritmul EM

prelucrare de Liviu Ciortuz, după

“What is the expectation maximization algorithm?”,
Chuong B. Do, Serafim Batzoglou,
Nature Biotechnology, vol. 26, no. 8, 2008, pag. 897-899
57.

Fie următorul experiment probabilist:

Dispunem de două monede, A şi B.
Efectuăm 5 serii de operaţiuni de tipul următor:
− Alegem ı̂n mod aleatoriu una dintre monedele A şi B, cu probabilitate
egală (1/2);
− Aruncăm de 10 ori moneda care tocmai a fost aleasă (Z) şi notăm
rezultatul, sumarizat ca număr (X) de feţe ‘head’ (ro. ‘stemă’) obţinute
ı̂n urma aruncării.
58.

A. La acest punct vom considera că s-a obţinut următorul rezultat pentru
experimentul nostru:
i Z i Xi
1 B 5H (5T )
2 A 9H (1T )
3 A 8H (2T )
4 B 4H (6T )
5 A 7H (3T )
Semnificaţia variabilelor aleatoare Zi şi Xi pentru i = 1, . . . , 5 din tabelul de
mai sus este imediată.
59.

i. Calculaţi θ̂A şi θ̂B , probabilităţile de apariţie a feţei ‘head’/stemă pentru

cele două monede, folosind definiţia clasică pentru probabilitatea eveni-
mentelor aleatoare, şi anume raportul dintre numărul de cazuri favorabile
şi numărul de cazuri posibile, relativ la ı̂ntregul experiment.

Răspuns:
Analizând datele din tabelul din enunţ, rezultă imediat
24 9
θ̂A = = 0.8 şi θ̂B = = 0.45.
24 + 6 9 + 11
De observat că termenii 6 şi respectiv 11 de la numitorii acestor fracţii
reprezintă numărul de feţe ‘tail’/ban care au fost obţinute la aruncarea
monedei A şi respectiv B: 6T = 1T + 2T + 3T , 11T = 5T + 6T .
60.
61.
Observaţie
Dacă ı̂n locul variabilelor binare Zi ∈ {A, B} pentru i = 1, . . . , 5 introducem
ı̂n mod natural variabilele-indicator Zi,A ∈ {0, 1} şi Zi,B ∈ {0, 1} tot pentru
i = 1, . . . , 5, definite prin Zi,A = 1 iff Zi = A, şi Zi,A = 0 iff Zi,B = 0, atunci
procesările necesare pentru calculul probabilităţilor/parametrilor θ̂A şi θ̂B
pot fi prezentate ı̂n mod sintetizat ca ı̂n tabelul de mai jos.a
i Zi,A Zi,B Xi
1 0 1 5H
2 1 0 9H ⇒
3 1 0 8H
4 0 1 4H
5 1 0 7H

aPrezentăm acest “artificiu” ca pregătire pentru rezolvarea (ulterioară a) punctului

B al prezentei probleme.
62.

Xi · Zi,A Xi · Zi,B
0H (0T ) 5H (5T )
9H (1T ) 0H (0T )
8H (2T ) 0H (0T )
⇒ 0H (0T ) 4H (6T ) ⇒
7H (3T ) 0H (0T )
P5 P5
i=1 Xi · Zi,A = 24H Xi · Zi,B = 9H
P5 Pi=1
5
i=1 (10 − Xi ) · Zi,A = 6T i=1 (10 − Xi ) · Zi,B = 11T


 24
 θ̂A =
 = 0.8
24 + 6
⇒
 9
 θ̂B =
 = 0.45
9 + 11
63.

not.
ii. Calculaţi L1 (θA , θB ) = P (X, Z | θA , θB ), funcţia de verosimilitate a datelor
not. not.
X = < X1 , . . . , X5 > şi Z = < Z1 , . . . , Z5 >, ı̂n raport cu parametrii θA şi re-
spectiv θB ai distribuţiilor Bernoulli care modelează aruncarea celor două
monede.

Observaţie: Facem presupunerea că am reţinut succesiunea [tuturor] rezul-

tatelor obţinute la aruncările celor două monede. Precizarea de facto a
acestei succesiuni nu este esenţială. Dacă nu s-ar face această presupunere,
ar trebui să să lucrăm cu distribuţia binomială.
64.
Răspuns:
Calculul verosimilităţii datelor:
5
Y
def. indep. cdt.
L1 (θA , θB ) = P (X, ZA , ZB | θA , θB ) = P (Xi , Zi,A , Zi,B | θA , θB )
i=1
5
Y
= P (Xi | Zi,A , Zi,B ; θA , θB ) · P (Zi,A , Zi,B | θA , θB )
i=1
= P (X1 | ZB,1 = 1, θB ) · 1/2 ·
P (X2 | ZA,2 = 1, θA ) · 1/2 ·
P (X3 | ZA,3 = 1, θA ) · 1/2 ·
P (X4 | ZB,4 = 1, θB ) · 1/2 ·
P (X5 | ZA,5 = 1, θA ) · 1/2
5 1
= θB (1 − θB )5 · θA9 (1 − θA ) · θA8 (1 − θA )2 · θB
4
(1 − θB )6 · θA7 (1 − θA )3 ·
25
1 24 6 9 11
= θA (1 − θA ) θB (1 − θB )
25
65.

not. not.
iii. Calculaţi θ̂A = arg maxθA log L1 (θA , θB ) şi θ̂B = arg maxθB log L1 (θA , θB )
folosind derivatele parţiale de ordinul ı̂ntâi.

Observaţii:
1. Baza logaritmului, fixată dar lăsată mai sus nespecificată, se va considera
supraunitară (de exemplu 2, e sau 10).
2. Lucrând corect, veţi obţine acelaşi rezultat ca la punctul i.
66.
Răspuns:
Funcţia de log-verosimilitate a datelor complete se exprimă astfel:
log L1 (θA , θB ) = −5 log 2 + 24 log θA + 6 log(1 − θA ) + 9 log θB + 11 log(1 − θB )
Prin urmare, maximul acestei funcţii ı̂n raport cu parametrul θA se cal-
culează astfel:
∂ log L1 (θA , θB ) 24 6 4 1
=0 ⇔ − =0⇔ = ⇔ 4 − 4 θA = θA ⇔ θ̂A = 0.8
∂θA θA 1 − θA θA 1 − θA

∂ log L1 (θA , θB )
Similar, se face calculul şi pentru şi se obţine θ̂B = 0.45.a
∂θB
Cele două valori obţinute, θ̂A şi θ̂B , reprezintă estimarea de verosimilitate
maximă (MLE) a probabilităţilor de apariţie a feţei ‘head’ (‘stema’) pentru
moneda A şi respectiv moneda B.
aSe verifică uşor faptul că ı̂ntr-adevăr rădăcinile derivatelor parţiale de ordinul ı̂ntâi
pentru funcţia de log-verosimilitate reprezintă puncte de maxim. Pentru aceasta se
studiază semnele acestor derivate.
67.

Observaţii

1. Am arătat pe acest caz particular că metoda de calculare a proba-

bilităţilor (θ̂A şi θ̂B ) direct din datele observate (aşa cum o ştim din liceu)
corespunde de fapt metodei de estimare ı̂n sensul verosimităţii maxime
(MLE).
2. La punctul B vom arăta cum anume se poate face estimarea aceloraşi
parametri θA şi θB ı̂n cazul in care o parte din date, şi anume variabilele Zi
(pentru i = 1, . . . , 5) sunt neobservabile.
68.

B. La acest punct se va relua experimentul de la punctul A, ı̂nsă de data

aceasta vom considera că valorile variabilelor Zi nu sunt cunoscute.

i Z i Xi
1 ? 5H (5T )
2 ? 9H (1T )
3 ? 8H (2T )
4 ? 4H (6T )
5 ? 7H (3T )
69.

iv. Pentru convenienţă, ı̂n locul variabilelor ,,neobservabile“ Zi pentru

i = 1, . . . , 5 vom considera variabilele-indicator (de asemenea neobservabile)
Zi,A , Zi,B ∈ {0, 1}, cu Zi,A = 1 iff Zi,B = 0 şi Zi,B = 1 iff Zi,A = 0. Evident,
ı̂ntrucât variabilele Zi sunt aleatoare, rezultă că şi variabilele Zi,A şi Zi,B
sunt aleatoare.
Folosind teorema lui Bayes, calculaţi mediile variabilelor neobservabile
Zi,A şi Zi,B condiţionate de variabilele observabile Xi . Veţi considera că
parametrii acestor distribuţii Bernoulli care modelează aruncarea mon-
(0) (0)
edelor A şi B au valorile θA = 0.6 şi respectiv θB = 0.5.
(0) (0) (0) (0)
Aşadar, se cer: E[Zi,A | Xi , θA , θB ] şi E[Zi,B | Xi , θA , θB ] pentru i = 1, . . . , 5.
Ca şi mai ı̂nainte, probabilităţile a priori P (Zi,A = 1) şi P (Zi,B = 1) se vor
considera egale cu 1/2.
70.

Notă
Algoritmul EM ne permite să facem ı̂n mod iterativ estimarea parametrilor
θA şi θB ı̂n funcţie de valorile variabilelor observabile, Xi , şi de valorile
(0) (0)
inţiale atribuite parametrilor (ı̂n cazul nostru, θA = 0.6 şi θB = 0.5).
Vom face o sinteză a calculelor de la prima iteraţie a algoritmului EM —
detaliate la punctele iv, v şi vi de mai jos — sub forma următoare, care
seamănă ı̂ntr-o anumită măsură cu tabelele de la punctul A:

i Zi,A Zi,B Xi E[Zi,A ] E[Zi,B ]

1 − − 5H 0.45H 0.55H
2 − − 9H ⇒
E 0.80H 0.20H M
⇒
3 − − 8H 0.73H 0.27H
4 − − 4H 0.35H 0.65H
5 − − 7H 0.65H 0.35H
71.

Notă (cont.)
Xi · E[Zi,A ] Xi · E[Zi,B ]
2.2H (2.2T ) 2.8H (2.8T )
7.2H (0.8T ) 1.8H (0.2T )
M
5.9H (1.5T ) 2.1H (0.5T )
⇒ 1.4H (2.1T ) 2.6H (3.9T ) ⇒
4.5H (1.9T ) 2.5H (1.1T )
P5 P5
i=1 X i · E[Z i,A ] = 21.3H i=1 Xi · E[Zi,B ] = 11.7H
P5 P5
i=1 (10 − X i ) · E[Z i,A ] = 8.7T i=1 (10 − Xi ) · E[Zi,B ] = 8.3T


 (1) 21.3
 θ̂A =
 ≈ 0.71
21.3 + 8.7
⇒
 (1) 11.7
 θ̂B
 = ≈ 0.58
11.7 + 8.3
72.

Notă (cont.)
Vom arăta că:
• ı̂ntr-adevăr, este posibilă calcularea mediilor variabilelor neobservabile
Zi,A şi Zi,B , condiţionate de variabilele observabile Xi şi ı̂n funcţie de
valorile asignate iniţial (0.6 şi 0.5 ı̂n enunţ, dar ı̂n general ele se pot
asigna ı̂n mod aleatoriu) pentru parametrii θA şi θB ;
• faţă de tabloul de sinteză de la punctul precedent, când toate vari-
abilele erau observabile şi se calculau produsele Xi · Zi,A şi Xi · Zi,B , aici
se ı̂nlocuiesc variabilele Zi,A şi Zi,B cu mediile E[Z
P i,A ] şi E[Zi,B ] ı̂n pro-
dusele respective.
P De fapt, ı̂n loc să se calculeze i Xi ·Zi,A se calculează
media E[ i Xi · Zi,A ], şi similar pentru B.

Notaţie: Pentru simplitate, ı̂n cele de mai sus (inclusiv ı̂n tabelele prece-
dente), prin E[Zi,A ] am notat E[Zi,A | Xi , θ(0) ], iar prin E[Zi,B ] am notat
not. (0) (0)
E[Zi,B | Xi , θ(0) ], unde θ(0) = (θA , θB ).
73.

Răspuns (iv)
Întrucât variabilele Zi,A au valori boolene (0 sau 1), rezultă că

E[Zi,A | Xi , θ(0) ] = 0 · P (Zi,A = 0 | Xi , θ(0) ) + 1 · P (Zi,A = 1 | Xi , θ(0) )

= P (Zi,A = 1 | Xi , θ(0) )

Probabilităţile P (Zi,A = 1 | X, θ(0) ) = P (Zi,A = 1 | Xi , θ(0) ), pentru i = 1, . . . , 5,

se pot calcula folosind teorema lui Bayes:

(0) P (Xi | Zi,A = 1, θ(0) ) · P (Zi,A = 1 | θ(0) )

P (Zi,A = 1 | Xi , θ ) = P (0) (0)
j∈{0,1} P (Xi | Zi,A = j, θ ) · P (Zi,A = j | θ )
(0)
P (Xi | Zi,A = 1, θA )
= (0) (0)
P (Xi | Zi,A = 1, θA ) + P (Xi | Zi,B = 1, θB )

S-a ţinut cont că P (Zi,A = 1 | θ(0) ) = P (Zi,B = 1 | θ(0) ) = 1/2 (a se vedea
enunţul).
74.

De exemplu, pentru i = 1 vom avea:

(0) 0.65 (1 − 0.6)5 1

E[ZA,1 | X1 , θ ] = = 5 ≈ 0.45
0.65 (1 − 0.6)5 + 0.55 (1 − 0.5)5 0.25
1+
0.24

Similar cu E[ZA,1 | X1 , θ(0) ] se calculează şi celelalte medii E[Zi,A | Xi , θ(0) ]

pentru i = 2, . . . , 5 şi E[Zi,B | Xi , θ(0) ] pentru i = 1, . . . , 5.
Am ı̂nregistrat aceste valori/medii ı̂n cel de-al doilea tabel din Nota de
mai sus.

Observaţie: Se poate ţine cont că, de ı̂ndată ce s-a calculat E[Zi,A | Xi , θ(0) ],
se poate obţine imediat şi E[Zi,B | Xi , θ(0) ] = 1 − E[Zi,A | Xi , θ(0) ], fiindcă
Zi,A + Zi,B = 1.
75.

v. Calculaţi media funcţiei de log-verosimilitate a datelor complete, X

(observabile) şi Z (neobservabile):
def.
L2 (θA , θB ) = EP (Z|X,θ(0) ) [log P (X, Z | θ)],
not. not. (0) (0)
unde θ = (θA , θB ) şi θ(0) = (θA , θB ).

Semnificaţia notaţiei de mai sus este următoarea:

funcţia L2 (θA , θB ) este o medie a variabilei aleatoare reprezentată de
log-verosimilitatea datelor complete (observabile şi, respectiv, neobserv-
abile), iar această medie se calculează ı̂n raport cu distribuţia probabilistă
condiţională a datelor neobservabile, P (Z | X, θ(0) ).

Observaţie: La elaborarea calculului, veţi folosi mai ı̂ntâi proprietatea de

liniaritate a mediilor variabilelor aleatoare, şi apoi rezultatele de la punctul
iv.
76.

Răspuns:

Media funcţiei de log-verosimilitate a datelor complete, L2 (θA , θB ), se cal-

culează astfel:

În continuare, omiţând din nou distribuţia probabilistă ı̂n raport cu care
se calculează media aceasta ı̂ntrucât ea poate fi subı̂nţeleasă, vom scrie:
77.

L2 (θA , θB )
5
Y Z Z 1
= E[log ·(θAi,A )Xi · [(1 − θA )Zi,A ]10−Xi · (θBi,B )Xi · [(1 − θB )Zi,B ]10−Xi · ]
i=1
2
5
X
= E[ [Xi · Zi,A · log θA + (10 − Xi ) · Zi,A · log(1 − θA ) +
i=1
5
X
E[ [Xi · Zi,B · log θB + (10 − Xi ) · Zi,B · log(1 − θB ) − log 2] ]
i=1
5
lin. med.
X
= [Xi · E[Zi,A ] · log θA + (10 − Xi ) · E[Zi,A ] · log(1 − θA ) +
i=1
5
X
[Xi · E[Zi,B ] · log θB + (10 − Xi ) · E[Zi,B ] · log(1 − θB ) − log 2]
5 i=1
X X ·E[Zi,A ] X ·E[Zi,B ] 1
= log[θA i · (1 − θA )(10−Xi )·E[Zi,A] · θB i · (1 − θB )(10−Xi )·E[Zi,B ] · ]
i=1
2
1
= log(θA2.2 · (1 − θA )2.2 · θB
2.8
· (1 − θB )2.8 · . . . · θA4.5 · (1 − θA )1.9 · θB
2.5
· (1 − θB )1.1 · ).
2
78.

La ultima egalitate de mai sus, cantităţile fracţionare provin din calculele

simple X1 · E[Z1,A | X1 , θ] ≈ 2.2, X1 · E[Z1,B | X1 , θ] ≈ 2.8, . . ., X5 · E[Z1,A |
X5 , θ] ≈ 4.5, X5 · E[Z1,B | X5 , θ] ≈ 2.5 (a se vedea tabelele din cadrul Notei
precedente).
79.
(1) not. (1) not.
vi. Calculaţi θA = arg maxθA L2 (θA , θB ) şi θB = arg maxθB L2 (θA , θB ).

Răspuns:

Valorile parametrilor θA şi θB pentru care se atinge maximul mediei funcţiei

de log-verosimilitate a datelor complete se obţin cu ajutorul derivatelor
parţiale de ordinul ı̂ntâi:a
∂L2 (θA , θB )
=0
∂θA
∂
⇒ (2.2 log θA + 2.2 log(1 − θA ) + . . . + 4.5 log θA + 1.9 log(1 − θA )) = 0
∂θA
2.2 2.2 4.5 1.9 (1)
⇒ − + ...+ − = 0 ⇒ . . . ⇒ θA ≈ 0.71.
θA 1 − θA θA 1 − θA
(1)
Similar, vom obţine θB ≈ 0.58.
a Este imediat că derivatele de ordinul al doilea au valori negative pe tot domeniul de definiţie.
80.

C. Formalizaţi paşii E şi M ai algoritmului EM pentru estimarea

parametrilor θA şi θB ı̂n condiţiile de la punctul B.

Răspuns:
Formulele care se folosesc ı̂n cadrul algoritmului EM pentru rezolvarea
problemei date — i.e. estimarea parametrilor θA şi θB atunci când vari-
abilele Zi sunt neobservabile —, se elaborează/deduc astfel:
81.

Pasul E:
E[Zi,A | X, θ] = P (Zi,A = 1 | X, θ) = P (Zi,A = 1 | Xi , θ)
1/2
z }| {
T. B. P (Xi | Zi,A = 1; θ) · P (Zi,A = 1 | θ)
=
P (Xi | Zi,A = 1; θ) · P (Zi,A = 1 | θ) + P (Xi | Zi,B = 1; θ) · P (Zi,B = 1 | θ)
| {z }
1/2

P (Xi | Zi,A = 1; θ)
=
P (Xi | Zi,A = 1; θ) + P (Xi | Zi,B = 1; θ)
θAXi (1 − θA )10−Xi
=
θAXi (1 − θA )10−Xi + θBXi
(1 − θB )10−Xi

Similar, vom obţine:

Xi
θB (1 − θB )10−Xi
E[Zi,B | X, θ] = Xi Xi
θA (1 − θA )10−Xi + θB (1 − θB )10−Xi
82.

Notând cu
− xi valoarea variabilei Xi ,
(t) (t)
− θA şi respectiv θB estimările parametrilor θA şi θB la iteraţia t a algo-
ritmului EM,
(t+1) (t+1) (t) (t)
− pi,A şi respectiv pi,B , mediile E[Zi,A | Xi , θA ] şi E[Zi,B | Xi , θB ],
vom avea:
(t) (t)
(t+1) (θA )xi (10 − θA )10−xi
pi,A = (t) (t) (t) (t)
(θA )xi (10 − θA )10−xi + (θB )xi (10 − θB )10−xi
(t) (t)
(t+1) (θB )xi (10 − θB )10−xi
pi,B = (t) (t) (t) (t)
(θA )xi (10 − θA )10−xi + (θB )xi (10 − θB )10−xi
83.

Pasul M:
Ca şi mai ı̂nainte, ı̂n formulele de mai jos vom folosi notaţiile simplificate
not. not.
E[Zi,A ] = E[Zi,A | Xi , θ(t) ] şi E[Zi,B ] = E[Zi,B | Xi , θ(t) ].
Cu aceste notaţii, procedând similar cu calculul de la partea B, punctul v,
vom avea:
5
Y x E[Zi,A ] x E[Zi,B ]
L2 (θA , θB ) = log θAi (1 − θA )(10−xi )E[Zi,A] θBi (1 − θB )(10−xi )E[Zi,B ]
i=1
84.
Prin urmare,
5 5
∂ 1 X 1 X
L2 (θA , θB ) = 0 ⇒ xi E[Zi,A ] = (10 − xi )E[Zi,A ]
∂θA θA i=1 1 − θA i=1
5
X 5
X
⇒ (1 − θA ) xi E[Zi,A ] = θA (10 − xi )E[Zi,A ]
i=1 i=1
5
X 5
X
⇒ xi E[Zi,A ] = 10 θA E[Zi,A ]
i=1 i=1
P5 P5
xi E[Z i,A ] i=1 xi E[Zi,B ]
⇒ θA = i=1
P şi, similar, θB = P
10 5i=1 E[Zi,A ] 10 5i=1 E[Zi,B ]
Aşadar, la pasul M al algoritmului EM vom avea:
P5 (t+1) P5 (t+1)
(t+1) i=1 xi pi,A (t+1) i=1 xi pi,B
θA = P5 (t+1) şi θB = P5 (t+1)
10 i=1 pi,A 10 i=1 pi,B
85.

Observaţie

Implementând algoritmul EM cu relaţiile obţinute pentru pasul E şi pasul

(10) (10)
M, după execuţia a 10 iteraţii se vor obţine valorile θA ≈ 0.80 şi θB ≈ 0.52.

Este interesant de observat că estimarea obţinută pentru parametrul θA

este acum la acelaşi nivel cu cea obţinută prin metoda verosimilităţii
maxime (MLE) ı̂n cazul observării tuturor variabilelor (0.80, vezi rezolvarea
de la partea A, punctul i), iar estimarea obţinută pentru parametrul θB a
coborât de la valorea 0.58 care a fost obţinută la prima iteraţie a algoritmu-
lui EM la o valoare (0.52) care este considerabil mai apropiată de estimarea
prin metoda MLE (0.45).
86.
87.

The EM algorithm for solving

a mixture of K categorical distributions

CMU, 2015 spring, Tom Mitchell, Nina Balcan, HW6, pr. 1

88.
Mixture models are helpful for modelling unknown subpopulations in data.
If we have a collection of data points X = {X1 , . . . , Xn }, where each Xi is [in-
dependently] drawn from one of K possible distributions, we can introduce a
discrete-valued random variable Zi ∈ {1, . . . , K} that indicates which distribu-
tion Xi is drawn from.
This exercise deals with the categorical mixture model, where each observation
Xi is a discrete value drawn from a categorical distribution. The parameter for
a categorical distribution is a K-dimensional
P vector π that lists the probability
of each of K possible values (therefore, k πk = 1).
For example, suppose our categorical mixture model has 3 underlying dis-
tributions. Then, each Zi could take on one of three values, {1, 2, 3}, with
respective probabilities π1 , π2 , π3 ; equivalently, Zi ∼ Categorical(π), where π =
(π1 , π2 , π3 ) ∈ R3+ , with π1 + π2 + π3 = 1. The observation Xi is then generated
from another categorical distribution, depending on the value of Zi .
The generative process for a categorical mixture model is summarized as fol-
lows:

Zi ∼ Categorical(π)
Xi ∼ Categorical(θZi )
89.

For this model, where we observe X but not Z, we want to learn the parame-
ters of the K categorical components Θ = {π, θ1 , . . . , θK }, where each θk ∈ RM is
the parameter for the categorical distribution associated with the k-th mix-
ture component. (It implies that each Xi can take on one of M possible
values). We will use the EM algorithm to accomplish this.

A note on notation and a hint:

When working with categorical distributionsit is helpful to use indicator func-
tions. The indicator function 1{x=j} has the value 1 when x = j and 0 otherwise.
With this notation, we can express the probability that a random variable
drawn from a categorical distribution (e.g., Y ∼ Categorical(φ), where φ ∈ RN )
takes on a particular value as
N
Y 1
P (Y ) = φi {Y =i} .
i=1
90.

a. What is the joint distribution P (X, Z; Θ)?

b. What is the posterior distribution of the latent variables, P (Z|X; Θ)?
c. Compute the expectation of the log-likelihood,

Q(Θ|Θ′ ) = EZ|X;Θ′ [log P (X, Z; Θ′ )]

d. What is the update step for Θ? That is, what Θ maximizes Q(Θ|Θ′ )?
Hint: Make sure your solution enforces the constraint that parameters to
categorical distributions must sum to 1. Lagrange multipliers are a great way
to solve constrained optimization problems.
91.

Answer:
a.
n
Y n
Y
P (X, Z; Θ) = P (Xi , Zi ; Θ) = P (Xi |Zi ; Θ)P (Zi ; Θ)
i=1 i=1
 1{Z
K
n Y M i =k}
Y Y 1{X =v }
= πk θkj i j 
i=1 k=1 j=1

b.
P (Xi |Zi = k; Θ)P (Zi = k; Θ)
Bayes F.
P (Zi = k|Xi ; Θ) =
P (Xi ; Θ)

QK QM 1{Xi =vj } 1{Zi =k}
k=1 πk j=1 θkj
P (Xi |Zi = k; Θ)P (Zi = k; Θ)
= PK = PK QM 1{Xi =vj }
l=1 P (Zi = l; Θ)P (Xi |Zi = l; Θ) l=1 πl j=1 θlj
92.
c.
def.
Q(Θ|Θ′ ) = EZ|X;Θ′ [log P (X, Z; Θ)]
  1{Z =k} 
n K M i
a.  Y Y  Y 1{Xi =vj }  
= EZ|X;Θ′ log πk θkj 
i=1 k=1 j=1
  
X K
n X M
X
= EZ|X;Θ′  1{Zi =k} log πk + 1{Xi =vj } log θkj 
i=1 k=1 j=1
 
X K
n X M
X
= EZ|X;Θ′ [1{Zi =k} ] log πk + 1{Xi =vj } log θkj 
i=1 k=1
| {z } j=1
P (Zi =k|Xi ;Θ)

Using
QM 1
not. ′ b.
′
πk′
j=1 (θkj )
{Xi =vj }
γik = EZ|X;Θ′ [1{Zi =k} ] = P (Zi = k|Xi ; Θ ) = PK ,
′ )1{Xi =vj }
′
QM
π
l=1 l (θ
j=1 lj

leads to  
X K
n X M
X
Q(Θ|Θ′ ) = γik log πk + 1{Xi =vj } log θkj 
i=1 k=1 j=1
93.

d. To find the updates rules for the parameters, we solve the following constrained
optimization problem:

maxπ,θ1:K Q(Θ|Θ′ )
PK
subject to k=1 πk = 1
PM
j=1 θkj = 1, ∀k = 1, . . . , K

We introduce a Lagrange multiplier for each constraint associated to this problem and
form the Lagrangian functional :
!  
K
X K
X M
X
L = Q(Θ|Θ′ ) + λπ 1 − πk + λθk 1 − θkj 
k=1 k=1 j=1

Now we will differentiate the Lagrangian functional with respect to each parameter we
want to optimize, set the derivative to zero, and solve for the optimal value.
94.
We’ll start with optimizing for π, the parameter to the categorical distribution for the
latent variables Z.
n n
∂L X 1 ∗ 1 X
=0⇔ γik · − λπ = 0 ⇔ πk = γik
∂πk i=1
πk λ π i=1

We can find the value of the Lagrange multiplier λπ by enforcing the constraint:
K K n K n
X X 1 X 1 XX
πk∗ =1⇔ γik = 1 ⇔ γik = 1 ⇔
λπ i=1 λπ i=1
k=1 k=1 k=1
X n
K X n X
X K
λπ = γik ⇔ λπ = γik ⇔ λπ = n
k=1 i=1 i=1 k=1
| {z }
1

Substituting this value for λπ back into the expression of πk∗ gives the answer:
n
1X
πk∗ = γik
n i=1
95.
The process for optimizing the observation parameters θkj is similar. First, we differ-
entiate the Lagrangian L with respect to θkj , set to zero, and [finally] solve.
n
∂L X 1
=0⇔ γik 1{Xi =vj } · − λθk = 0
∂θkj i=1
θ kj
n
∗ 1 X
⇒ θkj = γik 1{Xi =vj }
λθk i=1

Again, we can find the value of the Lagrange multiplier λθk by enforcing the summation
constraint:
M M n M n
X
∗
X 1 X 1 XX
θkj =1⇔ γik 1{Xi =vj } = 1 ⇔ γik 1{Xi =vj } = 1
j=1
λ
j=1 θk i=1
λθk j=1 i=1

M X
X n
⇔ λθk = γik 1{Xi =vj }
j=1 i=1

∗
Substituting this value for λθk back into the expression of θkj gives the answer:
Pn Pn
∗ γ 1
i=1 ik {Xi =vj } i=1 γik 1{Xi =vj }
θkj = PM Pn = Pn PM
l=1 i=1 γ ik 1 {Xi =vl } i=1 l=1 γik 1{Xi =vl }
96.

The EM algorithm for solving

a mixture of K categorical distributions,
applied to the problem of Word Sense Disambiguation,
i.e. identifying the semantic domains associated to words in a
text document

CMU, 2012 fall, Eric Xing, Aarti Singh, HW3, pr. 3

97.
The objective of this exercise is to derive the update equations of the
EM algorithm for optimizing the latent variables [designating the semantic
domains] involved in generating a text document.
Each word will be seen as a random variable w that can take values 1, . . . , V
from the vocabulary of words. In fact, we will denote each w by an array of V
components such that w(i) = 1 if w takes the value of the i-th word in the vocabulary.
PV
Hence, i=1 w(i) = 1.
Given a document containing words wj , j = 1, . . . , N, where N is the length
of the document, we will assume that these words are generated from a
mixture of K discrete topics:
K
X V
Y
P (w) = πm P (w|µm) and P (w|µm) = µm (i)w(i) ,
m=1 i=1

where
πm denotes the prior [probability] for the latent topic variable t = m,
not. PV
µk = (µk (1), . . . , µk (i), . . . , µk (V )), with µk (i) ≥ 0 for i = 1, . . . , V and i=1 µm (i) = 1 for
each k = 1, . . . , K, and
not.
µm (i) = P (w(i) = 1|t = m).
98.

a. In the expectation step, for each word wj , compute

not.
Fj (t) = P (t|wj ; θ),
the probability that wj belongs to each of the K topics, where θ is
the set of parameters of this mixture model.

b. In the maximization step, compute θ that maximizes the log-likelihood

of the data
N
Y
not.
l(w; θ) = log P (wj ; θ)
j=1

Hint: Suming over the latent topic variable, we can write l(w; θ) as
N N P
X X X
t P (wj , t; θ)
l(w; θ) = log P (wj , t; θ) = log Fj (t)
j=1 t j=1
Fj (t)

Further on, using Jensen’s inequality (see pr. EM-2) we get:

N X N
X P (wj , t; θ) X X
l(w; θ) ≥ Fj (t) log = Fj (t) log P (wj , t; θ) + H(Fj )
j=1 t
Fj (t) j=1 t

Hence compute θ as:

N X
X
argmax Fj (t) log P (wj , t; θ)
θ j=1 t
100.

Answer:
N X
X K
θ = argmax Fj (tj = m) log P (wj , tj = m|θ)
θ
j=1 m=1
N X
X K
= argmax Fj (tj = m) log P (wj |tj = m; θ) P (tj = m|θ)
θ j=1 m=1
K
N X V
!
X Y
= argmax Fj (tj = m) log πm µm (l)wj (l)
θ j=1 m=1 l=1

X K
N X V
X
= argmax [Fj (tj = m) log πm + Fj (tj = m) log µm (l)wj (l) ]
θ j=1 m=1 l=1

X K
N X V
X
= argmax [Fj (tj = m) log πm + Fj (tj = m) wj (l) log µm (l)] (3)
θ
j=1 m=1 l=1
To optimize µm (l): 101.

After eliminating from (3) the terms which are constant with respect to µm we get:
N
X V
X
Fj (tj = m) wj (l) log µm (l)
j=1 l=1

We will use a Lagrangian to constrain µm to be a probability distribution:

N V V
!
X X X
L(µm (l)) = Fj (tj = m) wj (l) log µm (l) + β µm (l) − 1
j=1 l=1 l=1

Then, solving for µm (l):

N
∂ X wj (l)
L(µm (l)) = 0 ⇔ Fj (tj = m) +β =0
∂µm (l) j=1
µm (l)
N
1 X 1 −β
⇔ Fj (tj = m) wj (l) + β = 0 ⇔ = PN
µm (l) j=1 µm (l) j=1 Fj (tj = m) wj (l)
PN
j=1 Fj (tj = m) wj (l)
⇔ µm (l) = (4)
−β
PV 102.
Knowing that l=1 µm (l) = 1, we have:
V PN V X N
X j=1 Fj (tj = m) w j (l) X
= 1 ⇔ −β = Fj (tj = m) wj (l)
l=1
−β l=1 j=1

Hence, substituting for −β in (4), we get:

PN PN
F
j=1 j j(t = m) w j (l) j=1 Fj (tj = m) wj (l)
µm (l) = PV PN = PN PV
l=1 j=1 Fj (tj = m) wj (l) j=1 l=1 Fj (tj = m) wj (l)
PN PN
F
j=1 j j (t = m) w j (l) j=1 Fj (tj = m) wj (l)
= V
= PN
j=1 Fj (tj = m)
PN X
j=1 Fj (tj = m) wj (l)
l=1
| {z }
1

Note: Intuitively the last expression can be interpreted as the portion

[of word[ occurrence]s] that had w(l) = 1 among all words [in the given
document] which are deemed to belong to cluster m.
103.
To optimize πm , we proceed similarly:
We begin by removing from (3) the terms that are constant with respect
to πm , thus getting:
N
X
Fj (tj = m) log πm
j=1
PK
So, using the Lagrangian with the constraint that m=1 πm = 1
N K
!
X X
L(πm ) = Fj (tj = m) log πm + β πm − 1 ,
j=1 m=1

and solving for πm , we have:

N N
∂ X Fj (tj = m) 1 X
L(πm ) = 0 ⇔ +β =0⇔ Fj (tj = m) = −β
∂πm j=1
π m πm j=1
PN
j=1 Fj (tj = m)
⇔ πm = (5)
−β
104.
PK
Since m=1 πm = 1, it gives us:
K PN K N
j=1 Fj (tj = m) 1 XX
X
=1 ⇔ Fj (tj = m) = 1
m=1
−β −β m=1 j=1
K X
X N
⇔ −β = Fj (tj = m)
m=1 j=1

Substituting for −β in (5), we get:

PN PN PN
F (t
j=1 j j = m) F (t
j=1 j j = m) j=1 Fj (tj = m)
πm = PK PN = =
m=1 j=1 Fj (tj = m) PN
K
X N
j=1 Fj (tj = m)
m=1
| {z }
1

Note: Intuitively the last expression can be interpreted as the portion [of
word[ occurrence]s] that belongs to cluster m among the total of N words.
105.

To summarize:
E step: QV
πmµm (l)wj (l)
l=1
Fj (tj = m) = PK QV for j = 1, . . . , N and m = 1, . . . , K
w (l)
m′ =1 πm′ l=1 µm′ (l)
j

M step: PN
j=1 Fj (tj = m) wj (l)
µm (l) = PN for m = 1, . . . , K and l = 1, . . . , V
j=1 Fj (tj = m)
PN
j=1 Fj (tj = m)
πm = for m = 1, . . . , K
N
106.

The EM algorithm for

solving mixtures of Poisson distributions
University of Utah, 2008 fall, Hal Daumé III, HW9, pr. 1
107.

0.4
The Poisson distribution is a distribu- λ=1
tion over positive count values; for a λ=4

0.3
λ = 10
count x with parameter λ, the Poisson
1 λx
has the form Poisson(x|λ) = λ · .

p(x)

0.2
e x!
It can be proven that the maximum

0.1
likelihood estimation for λ given a se-
quence of counts x1 , . . . , xn was simply

0.0
1 Pn
xi – the mean of the counts. 0 5 10 15 20
n i=1
x

Let’s consider a generalization of this: the Poisson mixture model. This is

actually used in web server monitoring. The number of accesses to a web
server in a minute typically follows a Poisson distribution.
108.

Suppose we have n web servers we are monitoring and we monitor

each for M minutes. Thus, we have n × M counts; call xi,m the
number of hits to web server i in minute m. Our goal is to cluster
the web servers according to their hit frequency.

Using the EM algorithm, construct a Poisson mixture model for

this problem and compute the expectations and maximization
steps for this model.
109.

Hint: Suppose we want K clusters; let zi be the latent variable

telling us which cluster the web server i belongs to (from 1 to
K). Let λk denote the parameter for the Poisson distribution for
cluster k, and πk the . Then, the complete data likelihood is:
n
Y n
Y
def. i.i.d.
L(λ̄, π̄) = p(x̄, z̄|λ̄, π̄) = p(xi , zi |λ̄, π̄) = p(xi |zi , λ̄, π̄) · p(zi |λ̄, π̄)
i=1 i=1 k
| {z }
πk
K
n Y
" M
#1{zi =k}
Y Y
= πk Poisson(xi,m |λk )
i=1 k=1 m=1

where
not. not.
x̄ = (x1 , . . . , xn ), with xi = (xi,1 , . . . , xi,M ) for i = 1, . . . , n,
not. not. not.
z̄ = (z1 , . . . , zK ), π̄ = (π1 , . . . , πK ); λ̄ = (λ1 , . . . , λK ), and
1{zi =k} is one if zi = k and zero otherwise.
110.
Solution
The likelihood function:
K
n Y
" M x
#1{zi =k}
Y Y 1 λk i,m
L(λ̄, π̄) = πk λk x
i=1 k=1 m=1
e i,m !

The likelihood function:

K
n X
" M x
#
def.
X X 1 λk i,m
ℓ(λ̄, π̄) = ln L(λ̄, π̄) = 1{zi =k} ln πk + ln
i=1 k=1 m=1
eλk xi,m !
K
n X
" M
#
X X
= 1{zi =k} ln πk + (−λk + xi,m ln λk − ln(xi,m )!)
i=1 k=1 m=1

The auxiliary function:

def.
Q(λ̄, π̄|λ̄(t) , π̄ (t) ) = EP (zi =k|xi,λ̄(t) ,π̄(t) ) [ℓ(λ̄, π̄)]
n X K
" M
#
X X
= P (zi = k|xi , λ̄(t) , π̄ (t) ) ln πk + (−λk + xi,m ln λk − ln(xi,m )!)
i=1 k=1 m=1
111.

E-step:
(t) not.
pik = P (zi = k|xi , λ̄(t) , π̄ (t) )

Bayes F. P (xi |zi = k, λ̄(t) , π̄ (t) ) · P (zi = k|λ̄(t) , π̄ (t) )

M-step:
PK
Since k=1 πk = 1, we have to introduce the Lagrange multiplier λ:
" K
#
∂ X
Q(λ̄, π̄|λ̄(t) , π̄ (t) ) + λ(1 − πk ) = 0 ⇔
∂πk
k=1
n (t) n
X pik (t+1) 1 X (t)
− λ = 0 ⇔ πk = p
i=1
πk λ i=1 ik

(t+1)
It is easy to see that πk is indeed a maximum point for Q w.r.t. the pa-
rameter πk .
PK (t+1)
Imposing the constraint k=1 πk = 1 leads to:
K n n K
1 X X (t) 1 X X (t) n
pik = 1 ⇔ pik = 1 ⇔ = 1 ⇔ λ = n.
λ λ i=1 λ
k=1 i=1 k=1
| {z }
1

(t+1) 1 Pn (t)
Therefore, πk = i=1 pik , which is alyways a non-negative quantity.
n
113.

Now, regarding the parameters λk :

n M !
∂ X X 1
Q(λ̄, π̄|λ̄(t) , π̄ (t) ) = 0 ⇔ pik −1 + xi,m · =0⇔
∂λk i=1 m=1
λ k

n M
! n M
! n
X 1 X 1 X X X
pik −M + xi,m =0⇔ pik xi,m =M pik ⇔
i=1
λk m=1 λk i=1 m=1 i=1

Pn P
M
i=1 pik m=1 xi,m
(t+1)
λk = Pn
M i=1 pik
(t+1)
It can be easily shown that for each k ∈ {1, . . . , K}, the value λk
is positive, and it maximizes Q w.r.t. the parameter λk .
114.

The EM algorithm for solving

a Gamma Mixture Model
G. Vegas-Sánchez-Ferrero, M. Martı́n-Fernández, J. Miguel Sanches
A Gamma Mixture Model for IVUS Imaging, 2014
[ adapted by Liviu Ciortuz ]
115.
The Gamma Mixture Model
K
X
p(x|θ) = πk Gamma(x|θk ),
k=1

not. not.
with r = (r1 , . . . , rK ), β = (β1 , . . . , βK ), θ = (r, β), and θk = (rk , βk ) for k = 1, . . . , K,
PK
and πk > 0 for k = 1, . . . , K and k=1 πk = 1.

Consider x1 , . . . , xn ∈ R+ , each xi being generated by one component of the above mix-

ture, denoted by zi ∈ {1, . . . , K}.
Find the maximum likelihood estimation of the parameters θ by using the EM algo-
rithm.
Solution
not. not.
• The verosimility of the “complete” data (x, z), with x = (x1 , . . . , xn ), and z =
(z1 , . . . , zn ), can be written as follows:
n
Y Y K
n Y n Y
Y K
i.i.d. 1{zi =k}
p(x, z|θ) = p(xi |zi , θ) = [p(xi |zi , θ) · p(zi |θ)] = [ p(xi |zi , θ) · πk ]1{zi =k}
| {z } | {z }
i=1 i=1 k=1 πk i=1 k=1
Gamma(xi |θk )
where 1{zi =k} is the indicator function.
116.

• The log-verosimility function:

n X
X K
def.
ℓ(θ) = ln p(x, z|θ) = 1{zi =k} [ln πk + ln Gamma(xi |θk )]
i=1 k=1
K
n X
θk =(rk ,βk ) X xi
= 1{zi =k} ln πk − rk ln βk − ln Γ(rk ) + (rk − 1) ln xi −
i=1 k=1
βk

• The auxiliary function:

" n K #
θ (t)
=(r (t)
,β (t)
) XX xi
Q(θ|θ(t) ) = EP (Z|X,θ(t) ) 1{zi =k} ln πk − rk ln βk − ln Γ(rk ) + (rk − 1) ln xi −
i=1 k=1
βk
K
n X
lin. of exp. X xi
= P (zi = k|xi , θ(t) ) ln πk − rk ln βk − ln Γ(rk ) + (rk − 1) ln xi −
i=1 k=1
| {z } βk
not.: γik
K
n X
X xi
= γik ln πk − rk ln βk − ln Γ(rk ) + (rk − 1) ln xi −
i=1 k=1
βk
117.

E step

π
z }|k {
(t+1) not. Bayes F. Gamma(xi |zi = k, θ) · P (zi = k|θ)
γik = P (zi = k|xi , θ(t) ) = PK
k′ =1 Gamma(xi |zi = k ′ , θ) · P (zi = k ′ |θ)
| {z }
πk ′

Gamma(xi |θk ) · πk
= PK
k′ =1 Gamma(xi |θk′ ) · πk′
(t+1)
Note that γik > 0 for k = 1, . . . , K because xi > 0 for i = 1, . . . , n.
118.
M step: πk
PK
Because k=1 πk = 1, we will use a Lagrange multiplier λ, and solve as usually:
" K
!# n n
∂ X X1 1 X
Q(θ|θ(t) ) + λ 1 − πk =0⇔ γik −λ=0⇔ γik = λ ⇔
∂πk i=1
π k πk i=1
k=1

n
(t+1) 1X
πk = γik
λ i=1
PK
By substituting this expression into the constraint k=1 πk = 1, we will get:
K n n K
1 XX 1 XX 1
γik = 1 ⇔ γik = 1 ⇔ n = 1 ⇔ λ = n
λ λ i=1 λ
k=1 i=1 k=1
| {z }
1

Therefore,
n
(t+1) 1X
πk = γik k (t+1)
n i=1
(t+1)
Note that πk > 0 for k = 1, . . . , K.
119.

M step: βk
n " n #
∂ X rk xi 1 X
Q(θ|θ(t) ) = 0 ⇔ γik − + 2 = 0 ⇔ 2 γik (xi − rk βk ) = 0
∂βk i=1
βk βk βk i=1
n
X n
X n
X n
X n
X
⇔ γik (xi − rk βk ) = 0 ⇔ γik rk βk = γik xi ⇔ rk βk γik = γik xi
i=1 i=1 i=1 i=1 i=1

Therefore,
Pn (t+1) (t+1)
(t+1) γ xi
βk = Pnik
i=1
for k = 1, . . . , K.
rk i=1 γik
(t+1)
It can be easily shown that βk is always positive, and it maxi-
mizes Q(θ|θ(t) ) w.r.t. βk .
120.
M step: rk

(t+1)
By substituting [the expression of] βk into Q(θ|θ(t) ), we will get:
n K
" #
(t+1) (t)
XX xi
Q(r, β |θ ) = γik ln πk − rk ln β (t+1) − ln Γ(rk ) + (rk − 1) ln xi − (t+1)
i=1 k=1 βk

Then, solving for the partial derivative of Q(r, β (t+1) |θ(t) ) w.r.t. rk ,
amounts to:
∂
Q(r, β (t+1) |θ(t) ) = 0 ⇔
∂rk
n
" #
∂ X (t+1) (t+1) xi
γik ln πk − rk ln βk − ln Γ(rk ) + (rk − 1) ln xi − (t+1)
=0⇔
∂rk i=1 βk
n
" #
′
X (t+1) (t+1) ∂ (t+1) Γ (rk ) ∂ 1
γik − ln βk − rk ln βk − + ln xi − xi =0
i=1
∂rk Γ(rk ) ∂rk β (t+1)
k
121.

Since
Pn (t+1) Pn (t+1)
!
∂ (t+1) ∂ γ
i=1 ik xi ∂ γ
i=1 ik xi 1
ln βk = ln Pn (t+1)
= − ln rk + ln Pn (t+1)
=−
∂rk ∂rk rk i=1 γik ∂rk rk
i=1 γik

and
Pn (t+1) Pn (t+1)
∂ 1 ∂ rk i=1 γik i=1 γik
= P = Pn ,
∂rk β (t+1) ∂rk n γ (t+1) xi γ
(t+1)
xi
k i=1 ik i=1 ik

it follows that
∂
Q(r, β (t+1) |θ(t) ) = 0 ⇔
∂rk
n
" Pn (t+1) Pn (t+1)
#
′
i′ =1 γi′ k xi′ Γ (rk ) i′ =1 γi′ k
X (t+1)
γik ln rk − ln Pn (t+1)
+1− + ln xi − xi Pn (t+1)
=0
Γ(rk )
i=1 i′ =1 γi′ k i′ =1 γi′ k xi′
122.
Therefore,
n
Γ′ (rk ) X (t+1)
ln rk − γ =
Γ(rk ) i=1 ik
n Pn (t+1) n n n Pn (t+1)
X (t+1) i′ =1 γi′ k xi′ X (t+1)
X (t+1)
X (t+1) i′ =1 γi′ k
γik ln Pn (t+1)
− γik − γik ln xi + γik xi · Pn (t+1)
⇔
i=1 i′ =1 γi′ k i=1 i=1 i=1 i′ =1 γi′ k xi′
n n Pn (t+1) n
Γ′ (rk ) X (t+1)
X (t+1) i′ =1 γi′ k xi′ X (t+1)
ln rk − γik = γik ln Pn (t+1)
− γik ln xi
Γ(rk ) i=1 i=1 i′ =1 γi′ k i=1

So,
Pn (t+1) Pn (t+1)
Γ′ (rk ) γ
i=1 ik xi i=1 γik ln xi
ln rik − = ln Pn (t+1)
− Pn (t+1)
Γ(rk ) γ γ
i=1 ik i=1 ik

Unfortunately, there is no nkown close form of rk satisfying this equation.

Γ′ (rk )
However, it can be shown that the function ln rik − is well-behaved,
Γ(rk )
(t+1)
yielding a unique solution to the above equation, which will be denoted rik ;
it can be found through certain numerical methods.

Γ′ (rk )not
Note: The function ψ(r) = ln rik − is the so-called di-gamma function.
Γ(rk )
123.

Note

It is interesting to see that

if in the update equations of this EM algorithm for solving the
1
Gamma Mixture Model we substitute r = rk and γk = for k =
k
1, . . . , K, we obtain exactly the Maximum Likeliood estimations for
the parameters r and β of the Gamma distribution.
124.

Using the EM algorithm for

estimating the selection probability for a mixture
of two (arbitrary) distributions
CMU, 2006 spring, Carlos Guestrin, final exam, pr. 8
CMU, 2004 fall, Carlos Guestrin, HW2, pr. 2.1
125.

We want to derive an EM algorithm for estimating the mixing parameter

for a mixture of arbitrary probability densities f1 and f2 .
For example, f1 (x) could be a standard normal distribution centered at
0, and f2 (x) could be the uniform distribution between [0, 1]. You can
think about such mixtures in the following way: First, you flip a coin.
With probability λ (i.e., the coin comes up heads), you will sample x from
density f1 , and with probability (1 − λ) you sample from density f2 .
More formally, let fλ (x) = λf1 (x) + (1 − λ)f2 (x), where f1 and f2 are arbitrary
probability density functions on R, and λ ∈ [0, 1] is an unknown mixture
parameter.
126.

a.
Given a data point x, and a value for the mixture parameter λ,
compute the probability that x was generated from density f1 .
b.
Now, suppose you are given a data set {x1 , . . . , xn } drawn i.i.d. from the
mixture density, and a set of coin flips {z1 , z2 , . . . , zn }, such that zi = 1 means
that xi is a sample from f1 , and zi = 0 means that xi was generated from
density f2 .
For a fixed parameter λ, compute the complete log-likelihood of the data,
i.e., ln P (x1 , z1 , x2 , z2 , . . . , xn , zn |λ).
c.
Now, suppose you are given only a sample {x1 , . . . , xn } drawn i.i.d. from
the mixture density, without the knowledge about which component the
samples were drawn from (i.e., the zi are unknown).
Using your derivations from part a and b, derive the E- and M-steps for
an EM algorithm to compute the maximum likelihood estimate (MLE) of
the mixture parameter λ.
127.
Solution
P (X = x|Z = 1) · P (Z = 1) λf1 (x)
a. P (Z = 1|X = x) = = .
P (X = x) fλ (x)
Qn Qn
b. P (x1 , z1 , x2 , z2 , . . . , xn , zn |λ) = i=1 P (xi , zi |λ) = i=1 (P (xi |zi , λ) · P (zi |λ)).
(
λf1 (xi ) if zi = 1
P (xi |zi , λ) · P (zi |λ) =
(1 − λ)f2 (xi ) if zi = 0
= λzi f1 (xi )zi (1 − λ)1−zi f2 (xi )1−zi

Therefore,
n
X
ln P (x1 , z1 , x2 , z2 , . . . , xn , zn |λ) = ln P (xi , zi |λ)
i=1
Xn

= ln λzi f1 (xi )zi (1 − λ)1−zi f2 (xi )1−zi
i=1
Xn
= zi ((ln λ + ln f1 (xi )) + (1 − zi )(ln(1 − λ) + ln f2 (xi )))
i=1
128.
c.
not. (t) [Link]. λ(t) f1 (xi )
E-step: q(zi ) = P (zi = 1|xi , λ ) =
fλ(t) (xi )
M-step:
n
Y
(t+1)
λ = argmax Eq [ln P (x1 , z1 , x2 , z2 , . . . , xn , zn |λ)]
λ i=1
Xn n
X
= argmax(ln λ · q(zi ) + ln(1 − λ) · (1 − q(zi )))
λ i=1 i=1
| {z } | {z }
c n−c

∂ c n−c
(c ln λ + (n − c) ln(1 − λ)) = 0 ⇔ =
∂λ λ 1−λ
Pn
c q(zi )
⇔ c(1 − λ) = λ(n − c) ⇔ c = nλ ⇔ λ = = i=1
n n
n
(t+1) λ(t) X f1 (xi )
⇒ λ =
n i=1 fλ(t) (xi )

∂2 c 1
Note: 2
(c ln λ + (n − c) ln(1 − λ)) = − − (n − c) 2
≤ 0 ⇒ λ(t+1) corresponds to a
∂λ λ (1 − λ)
maximum.
129.

Note

The EM algorithm can [be naturally extened so as to] handle mix-

tures of an arbritrary (although fixed) number of probabilistic
distributions.
130.

Exemplification [for a mixture of 3 densities]

Source: Brani Vidakovic, EM Algorithm and Mixtures [Handout 12]

A sample of size 150 is generated from

the mixture

f (x) = 0.5N (−5, 22)+0.3N (0, 0.52)+0.2N (2, 1),

where N denotes the normal distribu-

tion.
The mixing weights are estimated by the
EM algorithm.
M = 20 iterations of the EM algorithm
yielded the weights (0.4977, 0.2732 and
0.2290.
The nearby figure shows the histogram
of data, the theoretical mixture (dotted
line) and the EM estimate (solid line).
131.

Deriving the EM algorithm for

estimating the parameters of two random variables
of exponential distribution,
given a set of instances generated by their sum

CMU, 2004 fall, T. Mitchell, Z. Bar-Joseph, HW2, pr. 2.2

132.

Suppose that Z1 ∼ exp(1/λ1 ) and Z2 ∼ exp(1/λ2), and λ1 6= λ2 . Z1 and Z2 are

independent. Let X = Z1 + Z2 denote the sum of Z1 and Z2 . Suppose that
{x1 , x2 , . . . , xn } are i.i.d. samples from the distribution of X.

a. Derive an expression for the density of X in terms of λ1 and λ2 .

Hint 1: The density of Z1 is fλ1 (z) = λ1 · e−λ1 z for z ≥ 0, and 0 otherwise. The
density of Z2 is defined in a similar way.
Hint 2: You could first derive c.d.f. (cumulative distribution function) of X,
R x R x−z
which is defined as F (x) = P (Z1 + Z2 < x) = 0 0 1 fλ1 (z1 ) · fλ2 (z2 ) dz2 dz1 .

Explanation:
def.
F (x) = P (Z1 + Z2 < x)
Z x Z x−z2
= fλ1 (z1 ) dz1 · fλ2 (z2 ) dz2
0 0
Z x Z x−z1 Z x Z x−z1
= fλ2 (z2 ) dz2 · fλ1 (z1 ) dz1 = fλ1 (z1 ) · fλ2 (z2 ) dz2 dz1 .
0 0 0 0
133.

Exponential probability density function Exponential cumulative distribution function

1.0
1.5

λ = 0.5
λ=1

0.8
λ = 1.5
1.0

0.6
P(X ≤ x)
p(x)

0.4
0.5

λ = 0.5

0.2
λ=1
λ = 1.5

0.0
0.0

0 1 2 3 4 5 0 1 2 3 4 5

x x
134.
Answer
The c.d.f. of X:
Z x Z x−z1 Z x Z x−z1
def.
F (x) = P (Z1 + Z2 < x) = fλ1 (z1 ) · fλ2 (z2 ) dz2 dz1 = λ1 e−λ1 z1 · λ2 e−λ2 z2 dz2 dz1
0 0 0 0
Z x Z x−z1 Z x

−λ2 z2 x−z1
= (−λ1 )e−λ1 z1 (−λ2 )e −λ2 z2
dz2 dz1 = (−λ1 )e −λ1 z1
e 0
dz1
0 0 0
Z !
x
= (−λ1 )e−λ1 z1 e−λ2 (x−z1 ) − |e−λ 2 ·0
{z } dz1
0 1
Z x Z x
= (−λ1 )e−λ1 z1 e−λ2 (x−z1 ) dz1 − (−λ1 )e−λ1 z1 dz1
0 0
Z x Z x
−λ1 −λ2 x (λ2 −λ1 )z1
= e (λ2 − λ1 )e dz1 − (−λ1 )e−λ1 z1 dz1
λ2 − λ1 0 0
λ1
x x
= − e−λ2 x e(λ2 −λ1 )z1 − e−λ1 z1 0
λ2 − λ1 0
λ1 λ1 λ1
−λ2 x (λ2 −λ1 )x
= − e e − 1 − e−λ1 x − 1 = − e−λ1 x + e−λ2 x − e−λ1 x + 1
λ2 − λ1 λ2 − λ1 λ2 − λ1
1 −λ1 x −λ2 x −λ1 x −λ1 x
λ2 e−λ1 x − λ1 e−λ2 x
= 1− λ1 e − λ1 e + λ2 e − λ1 e = 1−
λ2 − λ1 λ2 − λ1
135.

The p.d.f. of X:

∂F (x) 1 λ1 λ2
p(x) = =− −λ1 λ2 e−λ1 x + λ1 λ2 e−λ2 x = e−λ1 x − e−λ2 x
∂x λ2 − λ1 λ2 − λ1
136.
b. Derive the E-step and M-step of the EM algorithm, and give explicit
expressions for the parameter updates in the EM process for computing the
MLE of λ1 and λ2 .

Pasul E (Expectation): Se calculează funcţia ,,auxiliară“ (media log-

verosimilităţii datelor complete) pentru iteraţia t:
Q(λ | λ(t) ) = EP (Z|X,λ(t) ) [log P (X, Z | λ)]
Pasul M (Maximization): Se maximizează media log-verosimilităţii datelor
complete, calculată la pasul E, ı̂n raport cu λ:

λ(t+1) = argmax Q(λ | λ(t) )

λ
Notaţiile de mai sus au următoarele semnificaţii:

λ = (λ1 , λ2 )
λ(t) = valoarea parametrului λ la iteraţia t
X = variabila observabilă, cu instanţele x1 , x2 , . . . , xn
Z = (Z1 , Z2 ) variabilele ascunse/neobservabile
având valorile z1j , z2j , . . . , znj cu j ∈ {1, 2},
aşa ı̂ncât xi = zi1 + zi2 .
137.

Solution: E step; computation of the “auxiliary” function

i=1
" n #
X
= EP (Z|X,λ(t) ) (log λ1 − λ1 zi1 + log λ2 − λ2 zi2 )
i=1
" n
#
X
= EP (Z|X,λ(t) ) n log λ1 + n log λ2 − (λ1 zi1 + λ2 zi2 )
i=1
n
X n
X
= n log λ1 + n log λ2 − λ1 Ep(zi1 |xi ,λ(t) ) [zi1 ] − λ2 Ep(zi2 |xi ,λ(t) ) [zi2 ]
i=1 i=1
138.
Solution: E step; computation of expectations
Z xi
def.
Ep(zi1 |xi ,λ(t) ) [zi1 ] = zi1 · p(zi1 | xi , λ(t) ) dzi1
0

(t) def. p(zi1 , xi | λ(t) )

p(zi1 | xi , λ ) =
p(xi | λ(t) )

p(zi1 , zi2 |λ(t) )

(t)
p(zi1 |λ1 ) · p(zi2 |λ2 ) a.
(t) fλ(t) (zi1 ) · fλ(t) (xi − zi1 )
1 2
= = =
p(xi | λ(t) ) p(xi | λ(t) ) (t) (t)
λ1 λ2 (t) (t)

(t) (t)
e−λ1 xi − e−λ2 xi
λ2 − λ1
(t) (t) (t) (t)
(t) (t) λ1 e−λ1 zi1
· λ2 e−λ2 (xi −zi1 )
= (λ2 − λ1 ) · (t) (t)

(t) (t)
λ1 λ2 e−λ1 xi − e−λ2 xi
(t) (t)
(t) (t) e(λ2 −λ1 )zi1
= (λ2 − λ1 ) · (t) (t)
(t)
e−λ1 xi − e−λ2 xi · eλ2 xi

(t) (t)
(t) (t) e(λ2 −λ1 )zi1
= (λ2 − λ1 ) · (t) (t)
e(λ2 −λ1 )xi −1
139.
Z Z (t) (t)
xi xi (λ2 −λ1 )zi1
def. (t) (t) (t) e
⇒ Ep(zi1 |xi ,λ(t) ) [zi1 ] = zi1 · p(zi1 | xi , λ ) dzi1 = zi1 · (λ2 − λ1 ) · (t) (t)
dzi1
0 0 e(λ2 −λ1 )xi −1
Z xi
1 (t) (t) (t) (t)
= (t) (t)
zi1 · (λ2 − λ1 ) · e(λ2 −λ1 )zi1
dzi1
e(λ2 −λ1 )xi −1 0

Vom rezolva ultima integrală de mai sus utilizând formula de integrare prin părţi.
Z xi Z xi
(t) (t) (t) (t)
(λ2 −λ1 )zi1 ∂ (λ2(t) −λ1(t) )zi1
zi1 · (λ2 − λ1 ) · e dzi1 = zi1 · e dzi1
0 0 ∂z i1
xi Z xi
(t) (t) (t) (t)
(λ2 −λ1 )zi1
= zi1 · e − e(λ2 −λ1 )zi1 dzi1
0 0
(t) (t)
xi (λ2 −λ1 )xi
(t) (t)
(λ2 −λ1 )xi
1 (t) (t)
(λ2 −λ1 )zi1
(t) (t) e −1
= xi · e(λ2 −λ1 )xi −

= xi · e −0 − (t) (t)
·e (t) (t)
λ2 − λ1 0 λ2 − λ1

Prin urmare, !
(t) (t)
(λ2 −λ1 )xi
1 (t) (t) e −1
Ep(zi1 |xi ,λ(t) ) [zi1 ] = (t) (t)
· xi · e(λ2 −λ1 )xi
− (t) (t)
e(λ2 −λ1 )xi −1 λ2 − λ1
(t) (t)
xi · e(λ2 −λ1 )xi
1
= (t) (t)
− (t) (t)
e (λ2 −λ1 )xi −1 λ2 − λ1
140.

xi = zi1 + zi2
⇒ Ep(zi2 |xi ,λ(t) ) [zi2 ] = Ep(zi2 |xi ,λ(t) ) [xi − zi1 ] = xi − Ep(zi2 |xi,λ(t) ) [zi1 ]
= xi − Ep(z|xi,λ(t) ) [zi1 ] = xi − Ep(zi1 |xi,λ(t) ) [zi1 ]
(t) (t)
(λ2 −λ1 )xi
xi · e 1
= xi − (t) (t)
+ (t) (t)
e(λ2 −λ1 )xi −1 λ2 − λ1
1 xi
= (t) (t)
− (t) (t)
λ2 − λ1 e(λ2 −λ1 )xi −1
141.
Solution: M step

λ(t+1) = argmax Q(λ | λ(t) )

λ
n n
!
X X
= argmax n log λ1 + n log λ2 − λ1 Ep(zi1 |xi ,λ(t) ) [zi1 ]− −λ2 Ep(zi2 |xi ,λ(t) ) [zi2 ]
λ1 >0,λ2 >0 i=1 i=1

Observaţie:
Se verifică imediat că aceste soluţii ı̂ntr-adevăr maximizează Q(λ | λ(t) ).
142.

Using the EM algorithm for

learning a Poisson distribution
when some data are missing

Liviu Ciortuz, following

Anirban DasGupta, Probability for Statistics and Machine
Learning, Springer, 2011, Ex. 20.8
143.
Presupunem că vrem să modelăm statistic numărul de accidente uşoare care s-au produs
ı̂n n locaţii ı̂ntr-un anumit interval de timp, să zicem o săptămână. În acest scop, vom
folosi o distribuţie Poisson de parametru λ. La sârşitul perioadei de timp respective, ni
se transmite de la m din cele n locaţii câte o ,,ı̂nregistrare“ (notată cu xi ), reprezentând
numărul de accidente uşoare produse ı̂n locaţia i.
În mod implicit, ar trebui să considerăm că ı̂n cele n − m locaţii de la care n-am primit
ı̂nregistrări nu s-a produs niciun accident. Însă, ı̂n urma ,,inspectării“ datelor suntem
determinaţi să luăm ı̂n considerare presupunerea că ı̂n unele din aceste k locaţii s-a
produs câte un [singur] accident uşor, care a fost ,,trecut sub tăcere“ la raportare.
Aşadar, formalizând, vom considera datele ,,neobservabile“ z1 = 0, . . . , zn0 = 0, zn0 +1 =
1, . . . , zn0 +n1 = 1, cu n0 + n1 = n − m, alături de datele observabile x1 , . . . , xm ∈ N. Toate
aceste date sunt produse de variabile aleatoare urmând distribuţia Poisson de [acelaşi]
parametru λ.
Elaboraţi algoritmul EM (ı̂n speţă pasul E şi pasul M) pentru estimarea parametrului
λ.
Sugestie: În loc să se lucreze cu zj , cu j = 1, . . . , m, va fi suficient să consideraţi ca date
neobservabile n0 şi n1 . Mai mult, considerând numerele n şi m cunoscute, va fi suficient
să lucraţi doar cu n1 ca dată neobservabilă (bineı̂nţeles, pe lângă datele observabile xi ).
144.
Answer
z1 , . . . , zn+0 , zn0 +1 , . . . , zn0 +n1 , x1 , . . . , xm ∼ Poisson(λ)
1 λx
Poisson p.m.f.: P (x|λ) = λ ·
e x!
The verosimility of complete data:
def.
L(λ) = P (z1 , . . . , zn+0 , zn0 +1 , . . . , zn0 +n1 , x1 , . . . , xm |λ)
+n1
n0Y m n0 n1 m
i.i.d.
Y Y 1 λ0 Y 1 λ Y 1 λxi
= P (zi |λ) · P (xi |λ) = λ 0!
· λ 1!
·
j=1 i=1 j=1
e j=n +1
e i=1
eλ xi !
0
m m
1 n1
Y λxi 1 n1
Y λxi
= ·λ · = λ n ·λ ·
(eλ )n0 +n1 +m i=1
xi ! (e ) x!
i=1 i
Pm 1 Pm 1
= e−nλ · λn1 · λ i=1 xi
· Qm = e−nλ · λn1 + i=1 xi
· Qm
i=1 xi ! i=1 xi !

The log-verosimility of complete data:

m
! m
def. X X
ℓ(λ) = ln L(λ) = −nλ + n1 + xi ln λ − ln xi !
i=1 i=1
145.

The auxiliary function:

m
! m
def. X X
(t) (t)
Q(λ|λ ) = En1 |xi ,λ(t) [ℓ(λ)] = −nλ + E[n1 |xi , λ ] + xi ln λ − ln xi !.
i=1 i=1

E-step:
En1 |xi ,λ(t) [ℓ(λ)] is the expected number of unobservable instances zj having value 1, out
of the total of n − m unobservable instances.
The definition of the density function for the Poisson distribution implies
1 λ1 1 1 λ0 1
P (x = 1|λ) = λ · = λ · λ and P (x = 0|λ) = λ · = λ.
e 1! e e 0! e
Therefore E[n1 |xi , λ(t) ] is exactly the expectation of the binomial distribution

λ(t)
Bin n − m; . So,
1 + λ(t)

(t) λ(t)
E[n1 |xi , λ ] = (n − m) .
1 + λ(t)
146.

M-step:
Now we can write
" m
# m
(t)
λ X X
Q(λ|λ(t) ) = −nλ + (n − m) + xi ln λ − ln xi !.
1 + λ(t) i=1 i=1

The first two derivatives of this function w.r.t. λ are:

" m
#
(t)
∂ λ X 1
Q(λ|λ(t) ) = −n + (n − m) + xi
∂λ 1 + λ(t) i=1 λ
" m
#
2 (t)
∂ (t) λ X 1
Q(λ|λ ) = − (n − m) + xi
∂λ2 1 + λ(t) i=1 λ2
147.

Pm
From the problem’s statement we know that n > m and i=1 xi ≥ 0.
At the initialization step of the EM algorithm the parameter λ is asigned a
positive value (λ(0) ), and we will soon see that λ(t) > 0 at each iteration t > 0.
Therefore, the second derivative of Q is always negative, meaning that Q is
concave and therefore it has a maximum.
In order to find this maximum we equate the first derivative to 0, which gives
us the update equation:
" m
#
(t)
1 λ X
λ(t+1) = (n − m) + xi
n 1 + λ(t) i=1

One can easily see now (ny induction) that λ(t+1) > 0.

S-ar putea să vă placă și

Curs 12 REI
Încă nu există evaluări
Curs 12 REI
5 pagini
Metode de Estimare A Parametrilor
Încă nu există evaluări
Metode de Estimare A Parametrilor
4 pagini
Curs5 Statistica
Încă nu există evaluări
Curs5 Statistica
33 pagini
Metode de Estimare A Parametrilor PDF
Încă nu există evaluări
Metode de Estimare A Parametrilor PDF
4 pagini
Estimatiesl Article
Încă nu există evaluări
Estimatiesl Article
13 pagini
Metoda Momentelor Si Verosimilitatii Maxime
Încă nu există evaluări
Metoda Momentelor Si Verosimilitatii Maxime
9 pagini
Metode de Estimare A Parametrilor
Încă nu există evaluări
Metode de Estimare A Parametrilor
4 pagini
TEMA ML Seminar 10
Încă nu există evaluări
TEMA ML Seminar 10
3 pagini
Notiuni de Teoria Estimatiei
Încă nu există evaluări
Notiuni de Teoria Estimatiei
9 pagini
Partea3 Estimatorul de Verosimilitate Maxima
Încă nu există evaluări
Partea3 Estimatorul de Verosimilitate Maxima
46 pagini
Teoria Estimării MCQ
Încă nu există evaluări
Teoria Estimării MCQ
8 pagini
Partea1 Principiul MLE
Încă nu există evaluări
Partea1 Principiul MLE
31 pagini
Pregatire Ex Final Etti 2022
Încă nu există evaluări
Pregatire Ex Final Etti 2022
17 pagini
Proiect Is
Încă nu există evaluări
Proiect Is
22 pagini
Curs12 Statistica2
Încă nu există evaluări
Curs12 Statistica2
15 pagini
Metode Numerice in Calculul Variatilor
Încă nu există evaluări
Metode Numerice in Calculul Variatilor
78 pagini
Calcul Stochastic - Curs Introductiv III - Slide
Încă nu există evaluări
Calcul Stochastic - Curs Introductiv III - Slide
26 pagini
C13 - Estimarea Parametrilor Semnalelor
Încă nu există evaluări
C13 - Estimarea Parametrilor Semnalelor
7 pagini
Model 2 - Subiecte Examen
Încă nu există evaluări
Model 2 - Subiecte Examen
3 pagini
Estimații
Încă nu există evaluări
Estimații
13 pagini
Noţiuni de Teoria Credibilităţii - Modelul de Credibilitate Bűhlmann
Încă nu există evaluări
Noţiuni de Teoria Credibilităţii - Modelul de Credibilitate Bűhlmann
10 pagini
Estimarea Punctuala A Parametrilor
Încă nu există evaluări
Estimarea Punctuala A Parametrilor
4 pagini
Carte Ingineria Sistemelor
Încă nu există evaluări
Carte Ingineria Sistemelor
115 pagini
Identificarea Sistemelor
100% (1)
Identificarea Sistemelor
115 pagini
TIA Curs3
Încă nu există evaluări
TIA Curs3
15 pagini
Ipoteze Statistice
Încă nu există evaluări
Ipoteze Statistice
10 pagini
Cap 3 (1) - MODELUL LINIAR MULTIPLU
100% (1)
Cap 3 (1) - MODELUL LINIAR MULTIPLU
18 pagini
Estimiatori
Încă nu există evaluări
Estimiatori
30 pagini
Prelucrarea Statistic A A Semnalelor
Încă nu există evaluări
Prelucrarea Statistic A A Semnalelor
179 pagini
Remedial BS Explicații
Încă nu există evaluări
Remedial BS Explicații
55 pagini
Seminar TP
Încă nu există evaluări
Seminar TP
217 pagini
Examen TPSM - 16 - Iunie - 2023 - Ora14
Încă nu există evaluări
Examen TPSM - 16 - Iunie - 2023 - Ora14
3 pagini
Curs 6 - TS PDF
Încă nu există evaluări
Curs 6 - TS PDF
23 pagini
Curs 14 Probabilitati Si Statistica Matematica
Încă nu există evaluări
Curs 14 Probabilitati Si Statistica Matematica
3 pagini
Curs - Statistica Matematica - UTCB
Încă nu există evaluări
Curs - Statistica Matematica - UTCB
132 pagini
Cursuri MMAE
Încă nu există evaluări
Cursuri MMAE
208 pagini
Variante Rezolvari
Încă nu există evaluări
Variante Rezolvari
17 pagini
Monica Dumitrescu - Statistica Matematica (Note de Curs)
Încă nu există evaluări
Monica Dumitrescu - Statistica Matematica (Note de Curs)
113 pagini
Remedial Econometrie FINAL-2
0% (4)
Remedial Econometrie FINAL-2
38 pagini
Aplicatii Grila Rezolvari 1
Încă nu există evaluări
Aplicatii Grila Rezolvari 1
12 pagini
Curs 12 Probabilitati - Estimare Prin Intervale de Incredere
Încă nu există evaluări
Curs 12 Probabilitati - Estimare Prin Intervale de Incredere
11 pagini
Mract Curs 2 Estimatii Punctuale Si Intervale
Încă nu există evaluări
Mract Curs 2 Estimatii Punctuale Si Intervale
17 pagini
Lucrare Pascu
Încă nu există evaluări
Lucrare Pascu
103 pagini
MS Curs 8
Încă nu există evaluări
MS Curs 8
10 pagini
Inf RB
Încă nu există evaluări
Inf RB
47 pagini
Ditributii
Încă nu există evaluări
Ditributii
5 pagini
Examen TPSM - Calc - 29!06!23
Încă nu există evaluări
Examen TPSM - Calc - 29!06!23
4 pagini
MN
Încă nu există evaluări
MN
22 pagini
03 Teorema Girsanov
100% (1)
03 Teorema Girsanov
11 pagini
Modelare Si Simulare 3
Încă nu există evaluări
Modelare Si Simulare 3
102 pagini
Fisa Disciplinei FMRF Dependenta
Încă nu există evaluări
Fisa Disciplinei FMRF Dependenta
3 pagini
Raspuns Reclamatie Client
Încă nu există evaluări
Raspuns Reclamatie Client
1 pagină
Exercitii de Invatare Automata
Încă nu există evaluări
Exercitii de Invatare Automata
38 pagini
Arduino 1
100% (1)
Arduino 1
94 pagini
Octavian Mihail Sachelarie Art9
Încă nu există evaluări
Octavian Mihail Sachelarie Art9
10 pagini
Exercitii de Invatare Automata
Încă nu există evaluări
Exercitii de Invatare Automata
10 pagini
Curs Rogers-Ivey DS
Încă nu există evaluări
Curs Rogers-Ivey DS
9 pagini
Criza Familială Generata de Migratie
Încă nu există evaluări
Criza Familială Generata de Migratie
5 pagini
Obedienţa Rezumat
Încă nu există evaluări
Obedienţa Rezumat
2 pagini
Documentatie Aplicatie Server-Client
Încă nu există evaluări
Documentatie Aplicatie Server-Client
3 pagini

Expectation Maximization Algorithm

Încărcat de

Expectation Maximization Algorithm

Încărcat de

0.

Algoritmul EM, fundamentare teoretică:

Algoritmul EM (Expectation-Maximization) permite crearea unor

În continuare vom nota cu q o funcţie / distribuţie de probabilitate definită

constituie o margine inferioară pentru funcţia de log-verosimilitate a

b. Vă reamintim definiţia entropiei relative (numită şi divergenţa

Pasul E: q (t) (z) = argmax F (q(z), θ(t) )

Pasul M: θ(t+1) = argmax F (q (t) (z), θ)

⇒ log P (x | θ) = F (q(z), θ) + KL(q(z) || P (z | x, θ)).

c. Fie θ(t) valoarea obţinută pentru parametrul / parametrii θ la iteraţia t

log P (x | θ(t) ) ≥ F (q(z), θ(t) )

În fine, dacă alegem q(z) = P (z | x, θ(t) ), atunci termenul KL(q(z)||P (z |

Acum vom calcula această valoare maximă:

= Q(θ(t) | θ(t) ) + H[P (z | x, θ(t) )]

Gt (θ) = Q(θ | θ(t) ) + H[P (z | x, θ(t) )]

Egalitatea precedentă este responsabilă pentru următoarea reformulare

Pasul E′ : calculează Q(θ | θ(t) ) = EP (z|x,θ(t) ) [log P (x, z | θ)]

prelucrare de Liviu Ciortuz, după

Pentru a maximiza funcţia de log-verosimilitate a datelor observabile, i.e.

reprezentând media log-verosimilităţii datelor complete (observabile şi

log P (x | θ) − log P (x | θ(t) ) ≥ Q(θ | θ(t) ) − Q(θ(t) | θ(t) ) (2)

log P (x | θ(t+1) ) ≥ log P (x | θ(t) ). log P(x| θ)

log P (x | θ) = Q(θ | θ(t) ) + CH(θ | θ(t) )

log P (x | θ(t) ) = Q(θ(t) | θ(t) ) + CH(θ(t) | θ(t) )

Demonstraţia relaţiei (2), cont.

log P (x | θ) − log P (x | θ(t) ) ≥ Q(θ | θ(t) ) − Q(θ(t) | θ(t) )

θ (t) θ (t+1) θ (t+2) θ

Using the EM algorithm for

Liviu Ciortuz, following

Location of the ABO gene on

parameters unobservable data observable data

Output EM algorithm Input

a. Considerăm că ı̂ntr-un eşantion de populaţie format din n per-

Pornind de la aceste date ,,observabile“, să se deriveze algoritmul

1. Presupunând că procesul de asociere a alelelor moştenite de

2. Considerând nA = nAA +nAO şi nB = nBB +nBO , unde semnificaţiile

3. La pasul E al algoritmului EM veţi calcula n̂AA şi n̂BB ,

Veţi scrie apoi funcţia de log-verosimilitate a datelor com-

4. La pasul M al algoritmului EM, pornind de la media funcţiei

The likelihood of complete data:

where c and c′ are constants which do not depend on the p parameter.

= argmax [ln c′ + 2 n̂AA ln pA + (nA − n̂AA )(ln pA + ln pO ) + 2 n̂BB ln pB +

b. Implementaţi algoritmul EM pe care l-aţi conceput la punctul

Ca valori iniţiale pentru parametri, veţi lucra mai ı̂ntâi cu pA =

Starting values: Starting values:

Using the EM algorithm for

Liviu Ciortuz, following

offspring diploid cell

homologous chromosomes homologous chromosomes

haploid cell haploid cell

Fie o variabilă aleatoare X care iavalori ı̂n mulţimea {v1 , v2 , v3 , v4 }

a. Presupunem că n1 , n2 , n3 şi n4 sunt numărul de ,,realizări“

where c is constant w.r.t. ψ.

∂ n1 n2 + n3 n4 n1 · ψ(1 − ψ) − (n2 + n3 ) · ψ(2 + ψ) + n4 · (2 + ψ)(1 − ψ)

Note that the denominator (2 + ψ)(1 − ψ)ψ is always positive.

while the log-verosimility function can be written as

ln P (D|ψ) = ln d + (n2 + n3 ) ln(1 − ψ) + (n12 + n4 ) ln(ψ),

where d is constant w.r.t. ψ.

⇒ the auxiliary function is concave.

The EM algorithm for solving

a. Show that E[zi |xi , θ] = P (zi = 1|xi , θ).

prelucrare de Liviu Ciortuz, după

Fie următorul experiment probabilist:

i. Calculaţi θ̂A şi θ̂B , probabilităţile de apariţie a feţei ‘head’/stemă pentru

aPrezentăm acest “artificiu” ca pregătire pentru rezolvarea (ulterioară a) punctului

Observaţie: Facem presupunerea că am reţinut succesiunea [tuturor] rezul-

1. Am arătat pe acest caz particular că metoda de calculare a proba-

B. La acest punct se va relua experimentul de la punctul A, ı̂nsă de data

iv. Pentru convenienţă, ı̂n locul variabilelor ,,neobservabile“ Zi pentru

i Zi,A Zi,B Xi E[Zi,A ] E[Zi,B ]

E[Zi,A | Xi , θ(0) ] = 0 · P (Zi,A = 0 | Xi , θ(0) ) + 1 · P (Zi,A = 1 | Xi , θ(0) )

Probabilităţile P (Zi,A = 1 | X, θ(0) ) = P (Zi,A = 1 | Xi , θ(0) ), pentru i = 1, . . . , 5,

(0) P (Xi | Zi,A = 1, θ(0) ) · P (Zi,A = 1 | θ(0) )

De exemplu, pentru i = 1 vom avea:

(0) 0.65 (1 − 0.6)5 1

Similar cu E[ZA,1 | X1 , θ(0) ] se calculează şi celelalte medii E[Zi,A | Xi , θ(0) ]

Fie o variabilă aleatoare X care iavalori ı̂n mulţimea {v1 , v2 , v3 , v4 }