zhuzilin's Blog

about

About MLE

date: 2019-01-30
tags: machine-learning  math  

It is really common to use maximum likelihood estimator (MLE) in machine learning. But do you ever think about the reason? Apart from the fact that it is prevalent and simple, here is why MLE is a very nice estimator.

Mean Square Error

Before discussing whether MLE is a good estimator, we need to come up with a criteria. And one of the most popular measure of a estimator is the mean square error (MSE). Here is the definition of MSE:

MSE(θ^)=E[θ^θ22]=trE[[(θ^θ)(θ^θ)T]MSE(\hat\theta)=E[||\hat\theta-\theta||_2^2]=tr{E[[(\hat\theta-\theta)(\hat\theta-\theta)^T]}

where θ\theta is the parameter of the statistical model. And from some trivial derivation, we could have a useful decomposition.

MSE(θ^)=tr[var(θ^)]+bias(θ^θ)22MSE(\hat\theta)=tr[var(\hat\theta)]+||bias(\hat\theta-\theta)||_2^2

where

bias(θ^θ)=E[θ^]θbias(\hat\theta-\theta)=E[\hat\theta]-\theta

And if bias is 0, the estimator is called unbiased.

Maximum Likelihood Estimator

And there are also a definition for MLE. First, the likelihood function is:

L(θ;x1,...,xn)=Πi=1nf(xi;n)L(\theta; x_1, ..., x_n)=\Pi _{i=1}^nf(x_i;n)

And the MLE of θ\theta is

θ^MLE=argmaxθΘL=argmaxθΘlogL\hat\theta^{MLE}=argmax_{\theta\in\Theta}L=argmax_{\theta\in\Theta}log{L}

And for the MLE, there is a hard theorem

Theorem

If L is smooth and behave in a nice way (here I omitted the strict conditions), it would

  • θ^MLEPθ\hat\theta^{MLE}\underrightarrow{\quad P\quad}\theta
  • n(θ^MLEθ)D,nN(0,I(θ)1)\sqrt{n}(\hat\theta^{MLE}-\theta)\underrightarrow{\quad D, n\rightarrow\infty\quad}N(0, I(\theta)^{-1})

where I(theta)I(theta) is the Fisher Information matrix.

In a word, MLE is consistent and asymptotic normal. For the prove, please visit here: prove.

And there is another theorem gives a lower bound of the variance of an estimator.

Theorem Cramer-Rao lower bound

Let X1,...,XnX_1,...,X_n be an i.i.d. sample of random variables with density or frequency function f(x;θ)f(x; \theta) and assume

  • The support of f(x;θ)f(x; \theta) (the area where f is not 0) does not depend on θ\theta.
  • f(x;θ)f(x; \theta) is differentiable with respect to θ\theta for all x.

then for an unbiased estimator T(X)T(X) of θ\theta, we will have

var(T(X))1nI(θ)var(T(X))\geq \frac{1}{nI(\theta)}

As nn\rightarrow\infty, MLE would be unbiased (because it is consistent) and therefore the asymptotic optimal estimator. In practice, we could believe that MLE is very good when the sample size is large enough.