zhuzilin's Blog

about

About Binary Classifier

date: 2019-02-25
tags: machine-learning  

There is a summary of many rudimental binary classifier. And for more complex ones like SVM, there will be a post in the future.

Bayes Classifier

For any classifier f:XYf: \mathcal{X}\rightarrow \mathcal{Y}, its prediction error is:

P(f(X)Y)=E[1 ⁣ ⁣1(f(X)Y)]=E[E[1 ⁣ ⁣1(f(X)Y)X]]P(f(X) \neq Y)=\mathbb{E}[1\!\!1(f(X) \neq Y)] =\mathbb{E}[\mathbb{E}[1\!\!1(f(X) \neq Y)|X]]

And for each xXx\in \mathcal{X},

E[1 ⁣ ⁣1(f(X)Y)X=x]=yYP(Y=yX=x)1 ⁣ ⁣1(f(x)y)\mathbb{E}[1\!\!1(f(X)\neq Y)|X=x] =\sum_{y\in \mathcal{Y}}P(Y=y|X=x)1\!\!1(f(x)\neq y)

For this fixed xx, to minimize the above value, we just need to maximize the only term that is not in the sum, which is

P(Y=yX=x)1 ⁣ ⁣1(f(x)=y)P(Y=y|X=x)1\!\!1(f(x)= y)

Therefore, the optimal classifier would be:

f(x)=argmaxyYP(Y=yX=x)f(x)=arg\max_{y\in\mathcal{Y}}P(Y=y|X=x)

The classifier with the above property for all xXx\in\mathcal{X} is called the Bayes classifier. In other words, the Bayes classifier will always predict the yy with the highest probability. And it has the smallest prediction error among all classifier.

And from the Bayes rule, we have

f(x)=argmaxyYP(Y=y,X=x)P(X=x)=argmaxyYP(Y=y)×P(X=xY=y)f(x)=arg\max_{y\in\mathcal{Y}}\frac{P(Y=y, X=x)}{P(X=x)} =arg\max_{y\in\mathcal{Y}}P(Y=y)\times P(X=x|Y=y)

Notice, since we have fix xx, P(X=x)P(X=x) is a constant value. And we can interpret the new formula as the product of class prior and data likelihood|class. In practice, we don't know any of them, so we approximate them.

Aside: If XX is a continuous variable, replace PP with pp.

For example, we could give the following assumption:

  • Class prior: P(Y=y)=πyP(Y=y)=\pi_y
  • Class conditional density

    py(x)=N(xμy,σy2),fory{0,1}p_y(x)=N(x|\mu_y, \sigma_y^2), for y\in\{0, 1\}
  • Bayes classifier:

    f(x)=argmaxy{0,1}p(X=xY=y)P(Y=y)={1ifπ1σ1exp[(xμ1)22σ12]>π0σ0exp[(xμ0)22σ02]0otherwise\begin{aligned} f(x)&=arg\max_{y\in\{0, 1\}}p(X=x|Y=y)P(Y=y)\\ &= \left\{{\begin{array}{c} 1 & if \frac{\pi_1}{\sigma_1}exp[-\frac{(x-\mu_1)^2}{2\sigma_1^2}]> \frac{\pi_0}{\sigma_0}exp[-\frac{(x-\mu_0)^2}{2\sigma_0^2}]\\ 0 & otherwise \end{array}}\right. \end{aligned}

gaussian bayes

This type of classifier is called a generative model. Because we are modeling xx and yy together in a distribution, instead of plug xx into a distribution on yy.

Plug-in classifier

We can also approximate the distribution purely on the data. This method is called Plug-in classifiers.

Still for the above gaussian model, we could estimate the parameter in the distribution as

  • Class priors:

    π^y=1ni=1n1 ⁣ ⁣1(yi=y)\hat{\pi}_y=\frac{1}{n}\sum_{i=1}^n1\!\!1(y_i=y)
  • Class conditional density:

    μ^y=1nyi=1n1 ⁣ ⁣1(yi=y)xiΣ^y=1nyi=1n1 ⁣ ⁣1(yi=y)(xiμ^y)(xiμ^y)T\begin{aligned} \hat{\mu}_y&=\frac{1}{n_y}\sum_{i=1}^n1\!\!1(y_i=y)x_i\\ \hat{\Sigma}_y&=\frac{1}{n_y}\sum_{i=1}^n1\!\!1(y_i=y)(x_i-\hat{\mu}_y)(x_i-\hat{\mu}_y)^T \end{aligned}

Naive Bayes

Naïve Bayes is a Bayes classifier that has the following assumption

p(X=xY=y)=Πj=1dpj(x[j]Y=y)p(X=x|Y=y)=\Pi_{j=1}^d p_j(x[j]|Y=y)

which is assuming the dimensions of XX as conditional independent given yy.

And then we could use the plug-in method to estimate the distribution.

Linear Classifier

Back to the Bayes classifier. For the binary classification, y=1y=1 if

p(xy=1)P(y=1)>p(xy=0)P(y=0)lnp(xy=1)P(y=1)p(xy=0)P(y=0)>0\begin{aligned} p(x|y=1)P(y=1)&>p(x|y=0)P(y=0)\\ &\Updownarrow\\ \ln{\frac{p(x|y=1)P(y=1)}{p(x|y=0)P(y=0)}}&>0 \end{aligned}

And if we assume

p(xy)=N(xμy,Σ)p(x|y)=N(x|\mu_y, \Sigma)

which is they have the same covariance matrix

lnp(xy=1)P(y=1)p(xy=0)P(y=0)=lnπ1π012(μ1+μ0)TΣ1(μ1μ0)+xTΣ1(μ1μ0)\begin{aligned} \ln{\frac{p(x|y=1)P(y=1)}{p(x|y=0)P(y=0)}}=&\ln\frac{\pi_1}{\pi_0}-\frac{1}{2}(\mu_1+\mu_0)^T\Sigma^{-1}(\mu_1-\mu_0)\\ &+x^T\Sigma^{-1}(\mu_1-\mu_0) \end{aligned}

This is called "linear discriminant analysis". And in this way, we turn this Bayes classifier into a linear one

f(x)=sign(xTw)f(x)=sign(x^Tw)

For assumption

p(xy)=N(xμy,Σy)p(x|y)=N(x|\mu_y, \Sigma_y)

we have

lnp(xy=1)P(y=1)p(xy=0)P(y=0)=something not involving x+xT(Σ11μ1Σ01μ0)+xT(Σ11/2Σ01/2)x\begin{aligned} \ln{\frac{p(x|y=1)P(y=1)}{p(x|y=0)P(y=0)}}=&something\ not\ involving\ x\\ &+x^T(\Sigma^{-1}_1\mu_1-\Sigma^{-1}_0\mu_0)\\ &+x^T(\Sigma^{-1}_1/2-\Sigma^{-1}_0/2)x\\ \end{aligned}

This means that when having different covariant matrix, we could have a polynomial regression.

In LDA above, ww and w0w_0 are fixed, which may be too restrictive and we may do better with other values. How could we get a more general classifier?

Least Square

The first thought is use least square.

least square linear classifier

The problem is that this method is very sensitive to outlier and therefore performs badly.

The Perceptron Algorithm

If we just think of the value of wrongly classified data points, we could avoid the outlier problem in the least square method. So in perceptron, the loss function is

L=i=1n(yixiTw)1 ⁣ ⁣1{yisign(xiTw)}\mathcal{L}=-\sum_{i=1}^n(y_ix_i^Tw)1\!\!1\{y_i\neq sign(x_i^Tw)\}

And we will use gradient descent to solve this problem.

But the problem of perceptron is it can only deal with data that is linear separable.

Logistic Regression

We could directly plug in the linear representation for the log odds:

lnp(y=+1x)p(y=1x)=xTw\ln{\frac{p(y=+1|x)}{p(y=-1|x)}}=x^Tw

And we would have

p(y=+1x)=σ(w)=exTw1+exTwp(y=+1|x)=\sigma(w)=\frac{e^{x^Tw}}{1+e^{x^Tw}}

Notice, this is a discriminative classifier, because only the conditional distribution is formed.

The loss function would be the likelihood, which is

L=Πi=1np(yixi,w)=Πi=1nσi(w)1 ⁣ ⁣1(yi=+1)(1σi(w))1 ⁣ ⁣1(yi=1)\begin{aligned} L&=\Pi_{i=1}^np(y_i|x_i, w)\\ &=\Pi_{i=1}^n\sigma_i(w)^{1\!\!1(y_i=+1)}(1-\sigma_i(w))^{1\!\!1(y_i=-1)} \end{aligned}

Also, we could use the gradient descent to minimize likelihood to get an ideal ww. Notice that when updating ww, we would weight each data point by the degree of correctness, which makes logistic regression outlier invulnerable.

The problem for logistic regression is:

  • If data is linear separable, then wML2||w_{ML}||_2\rightarrow\infty. In other words, the gradient descent will not converge. This is because go to infinity could drives σi(yiw)1\sigma_i(y_i w)\rightarrow1 for all (xi,yi)(x_i, y_i) and maximize the likelihood.
  • For nearly separable data, it may get a few very wrong in order to be more confident about the rest, which is overfitting.

And the solution would be add a regularization term to the loss function. And from the probabilistic point of view, it is maximizing the posterior distribution (MAP) while assuming the prior has a normal distribution.

wMAP=argmaxwlnp(wy,x)=argmaxwlnp(yw,x)+lnp(w)=argmaxwi=1nlnσi(yiw)λwTw\begin{aligned} w_{MAP}&=arg\max_w\ln p(w|y, x)\\ &=arg\max_w\ln p(y|w,x)+\ln p(w)\\ &=arg\max_w\sum_{i=1}^n\ln\sigma_i(y_iw)-\lambda w^Tw \end{aligned}

And as in the linear regression case, we would use λwTw\lambda w^Tw as the penalty term.

Bayesian logistic regression and Laplace approximation

From the prior we just used (wN(0,λ1I)w\sim N(0, \lambda^-1I)), we have

p(wx,y)=p(w)Πi=1nσi(yiw)p(w)Πi=1nσi(yiw)dwp(w|x, y)=\frac{p(w)\Pi_{i=1}^n\sigma_i(y_iw)}{\int p(w)\Pi_{i=1}^n\sigma_i(y_iw)dw}

which we cannot calculate directly.

Our strategy is to approximate p(wx,y)p(w|x, y) with an normal distribution and we need a method for setting μ\mu and Σ\Sigma.

We would use the Laplace approximation to estimate it.

p(wx,y)=p(y,wx)p(yx)=p(y,wx)p(y,wx)dw=elnp(y,wx)elnp(y,wx)dw\begin{aligned} p(w|x,y)&=\frac{p(y,w|x)}{p(y|x)}=\frac{p(y,w|x)}{\int p(y,w|x)dw}\\ &=\frac{e^{\ln p(y,w|x)}}{\int e^{\ln p(y,w|x)}dw} \end{aligned}

Let's define f(w)=lnp(y,wx)f(w)=\ln p(y,w|x). If we use the Taylor expansion, we could have

f(w)f(z)+(wz)Tf(z)+12(wz)T(2f(z))(wz)f(w)\approx f(z)+(w-z)^T\nabla f(z)+\frac{1}{2}(w-z)^T(\nabla^2f(z))(w-z)

where z=wMAPz = w_{MAP}. And since wMAPw_{MAP} would leads to f(z)=0\nabla f(z)=0, we have

p(wx,y)=e12(wwMAP)T(2f(wMAP))(wwMAP)e12(wwMAP)T(2f(wMAP))(wwMAP)dwp(w|x,y)=\frac {e^{-\frac{1}{2}(w-w_{MAP})^T(-\nabla^2f(w_{MAP}))(w-w_{MAP})}} {\int e^{-\frac{1}{2}(w-w_{MAP})^T(-\nabla^2f(w_{MAP}))(w-w_{MAP})}dw}

which makes

μ=wMAPΣ=(2lnp(y,wMAPx))1\begin{aligned} \mu&=w_{MAP}\\ \Sigma&=(-\nabla^2\ln p(y, w_{MAP}|x))^{-1} \end{aligned}

And what we have is that the posterior distribution would be approximately a gaussian distribution.