The material of this post is from ESL.
Two-Component Mixture Model
Suppose we have a model as such:
Y1∼N(μ1,σ12)Y2∼N(μ2,σ22)Y=(1−Δ)Y1+ΔY2
where
Δ∈{0,1} with P(Δ=1)=π
Therefore, the density function of Y is
f(y)=(1−π)ϕθ1(y)+πϕθ2(y)
Then the log-likelihood would be
logL(θ;Z)=i=1∑Nlog[(1−π)ϕθ1(y)+πϕθ2(y)]
It would be hard to calculate the MLE estimator. But if we have the Δi, then the log-likelihood would be
logL(θ;Z,Δ)=i=1∑N[(1−Δi)logϕθ1(y)+Δilogϕθ2(y)]+i=1∑N[(1−Δi)log(1−π)+Δilogπ]
And the MLE will be much easier.
However, since Δi is unknown, we will proceed in an iterative fashion, substituting for each Δi in its expected value
γi(θ)=E(Δi∣θ,yi)=Pr(Δi=1∣θ,yi)
From Bayes rule
P(Δi=1∣yi,θ)=P(y∣θ)P(Δi=1,y∣θ)=P(y∣Δi=0,θ)P(Δi=0∣θ)+P(y∣Δi=1,θ)P(Δi=1∣θ)P(y∣Δi=1,θ)P(Δi=1∣θ)=ϕθ2(yi)π+ϕθ1(yi)(1−π)ϕθ2(yi)π
this is also called as the responsibility for observation i. We use a procedure called the EM algorithm.
- Take initial guesses for the parameters μ^1,σ^12,μ^2,σ^22,pi^
-
Expectation Step: compute the responsibility
γ^i=ϕθ2(yi)π+ϕθ1(yi)(1−π)ϕθ2(yi)π
-
Maximization Step: compute the weighted means and variance. (notice the soft assignment.)
μ^1=∑i=1N(1−γ^i)∑i=1N(1−γ^i)yiμ^2=∑i=1Nγ^i∑i=1Nγ^iyi, σ^12=∑i=1N(1−γ^i)∑i=1N(1−γ^i)(yi−μ^1)2, σ^12=∑i=1Nγ^i∑i=1Nγ^i(yi−μ^1)2
- Iterate 2 and 3 until convergence.
The EM Algorithm in General
EM algorithm is for problem like above that are difficult to maximize the likelihood, but is easier to enlarge the sample with latent (unobserved) data. This is called data augmentation.
As for the general formulation of the EM algorithm. Assume the log-likelihood is L(θ,Z) and the latent variable Zm. Let T=(Z,Zm) For the above problem, (Z,Zm)=(y,Δ).
The EM algorithm would be:
- Initialize θ^(0)
-
Expectation Step: at the jth step, compute:
Q(θ′,θ^(j))=E(L0(θ′;T)∣Z,θ^(j))
as a function of the dummy argument θ′.
-
Maximization Step: determine the new estimate θ^(j+1) as the maximizer of Q(θ′,θ^(j)) over θ′., that is
θ^(j+1)=argθ′maxQ(θ′,θ^(j))
- Iterate steps 2 and 3 until convergence.
There are two major problem for us now to solve. First is why this algorithm is correct and second, how is this algorithm with the weird Q the same as the algorithm we used in the previous section.
From the Bayes rule, we have
Pr(Zm∣Z,θ′)=Pr(Z∣θ′)Pr(Zm,Z∣θ′)
So we have
Pr(Z∣θ′)=Pr(Zm∣Z,θ′)Pr(T∣θ′)
And correspond to log-likelihood:
logL(θ′,Z)=logL0(θ′,T)−logL1(θ′,Zm∣Z)
And notice the two term at left are all function of Zm. If we take conditional expectation,
L(θ′,Z)=E[L0(θ′,T)∣Z,θ]−E[L1(θ′,Zm∣Z)∣Z,θ]=Q(θ′,θ)−R(θ′,θ)
For the above problem,
logL(θ′,Z)=i=1∑Nlog[(1−π)ϕθ1(y)+πϕθ2(y)]
And
logL0(θ′;Z,Δ)=i=1∑N[(1−Δi)logϕθ1(y)+Δilogϕθ2(y)]+i=1∑N[(1−Δi)log(1−π)+Δilogπ]
And
Q(θ′,θ)=E[L0(θ′,T)∣Z,θ]=i=1∑N[(1−E[Δi∣Z,θ])logϕθ1(y)+E[Δi∣Z,θ]logϕθ2(y)]+i=1∑N[(1−E[Δi∣Z,θ])log(1−π)+E[Δi∣Z,θ]logπ]=i=1∑N[(1−γ^i)logϕθ1(y)+γ^ilogϕθ2(y)]+i=1∑N[(1−γ^i)log(1−π)+γ^ilogπ]
Therefore, the θ for the above problem is γ^i. θ are parameter that is calculated by the parameter previous iteration (θ^(i)).
And the question now is why only maximizing Q could end up get the maximal value for logL.