There is a summary of many rudimental binary classifier. And for more complex ones like SVM, there will be a post in the future.
Bayes Classifier
For any classifier f:X→Y, its prediction error is:
P(f(X)=Y)=E[11(f(X)=Y)]=E[E[11(f(X)=Y)∣X]]
And for each x∈X,
E[11(f(X)=Y)∣X=x]=y∈Y∑P(Y=y∣X=x)11(f(x)=y)
For this fixed x, to minimize the above value, we just need to maximize the only term that is not in the sum, which is
P(Y=y∣X=x)11(f(x)=y)
Therefore, the optimal classifier would be:
f(x)=argy∈YmaxP(Y=y∣X=x)
The classifier with the above property for all x∈X is called the Bayes classifier. In other words, the Bayes classifier will always predict the y with the highest probability. And it has the smallest prediction error among all classifier.
And from the Bayes rule, we have
f(x)=argy∈YmaxP(X=x)P(Y=y,X=x)=argy∈YmaxP(Y=y)×P(X=x∣Y=y)
Notice, since we have fix x, P(X=x) is a constant value. And we can interpret the new formula as the product of class prior and data likelihood|class. In practice, we don't know any of them, so we approximate them.
Aside: If X is a continuous variable, replace P with p.
For example, we could give the following assumption:
This type of classifier is called a generative model. Because we are modeling x and y together in a distribution, instead of plug x into a distribution on y.
Plug-in classifier
We can also approximate the distribution purely on the data. This method is called Plug-in classifiers.
Still for the above gaussian model, we could estimate the parameter in the distribution as
Naive Bayes
Naïve Bayes is a Bayes classifier that has the following assumption
p(X=x∣Y=y)=Πj=1dpj(x[j]∣Y=y)
which is assuming the dimensions of X as conditional independent given y.
And then we could use the plug-in method to estimate the distribution.
Linear Classifier
Back to the Bayes classifier. For the binary classification, y=1 if
p(x∣y=1)P(y=1)lnp(x∣y=0)P(y=0)p(x∣y=1)P(y=1)>p(x∣y=0)P(y=0)⇕>0
And if we assume
p(x∣y)=N(x∣μy,Σ)
which is they have the same covariance matrix
lnp(x∣y=0)P(y=0)p(x∣y=1)P(y=1)=lnπ0π1−21(μ1+μ0)TΣ−1(μ1−μ0)+xTΣ−1(μ1−μ0)
This is called "linear discriminant analysis". And in this way, we turn this Bayes classifier into a linear one
f(x)=sign(xTw)
For assumption
p(x∣y)=N(x∣μy,Σy)
we have
lnp(x∣y=0)P(y=0)p(x∣y=1)P(y=1)=something not involving x+xT(Σ1−1μ1−Σ0−1μ0)+xT(Σ1−1/2−Σ0−1/2)x
This means that when having different covariant matrix, we could have a polynomial regression.
In LDA above, w and w0 are fixed, which may be too restrictive and we may do better with other values. How could we get a more general classifier?
Least Square
The first thought is use least square.
The problem is that this method is very sensitive to outlier and therefore performs badly.
The Perceptron Algorithm
If we just think of the value of wrongly classified data points, we could avoid the outlier problem in the least square method. So in perceptron, the loss function is
L=−i=1∑n(yixiTw)11{yi=sign(xiTw)}
And we will use gradient descent to solve this problem.
But the problem of perceptron is it can only deal with data that is linear separable.
Logistic Regression
We could directly plug in the linear representation for the log odds:
lnp(y=−1∣x)p(y=+1∣x)=xTw
And we would have
p(y=+1∣x)=σ(w)=1+exTwexTw
Notice, this is a discriminative classifier, because only the conditional distribution is formed.
The loss function would be the likelihood, which is
L=Πi=1np(yi∣xi,w)=Πi=1nσi(w)11(yi=+1)(1−σi(w))11(yi=−1)
Also, we could use the gradient descent to minimize likelihood to get an ideal w. Notice that when updating w, we would weight each data point by the degree of correctness, which makes logistic regression outlier invulnerable.
The problem for logistic regression is:
- If data is linear separable, then ∣∣wML∣∣2→∞. In other words, the gradient descent will not converge. This is because go to infinity could drives σi(yiw)→1 for all (xi,yi) and maximize the likelihood.
- For nearly separable data, it may get a few very wrong in order to be more confident about the rest, which is overfitting.
And the solution would be add a regularization term to the loss function. And from the probabilistic point of view, it is maximizing the posterior distribution (MAP) while assuming the prior has a normal distribution.
wMAP=argwmaxlnp(w∣y,x)=argwmaxlnp(y∣w,x)+lnp(w)=argwmaxi=1∑nlnσi(yiw)−λwTw
And as in the linear regression case, we would use λwTw as the penalty term.
Bayesian logistic regression and Laplace approximation
From the prior we just used (w∼N(0,λ−1I)), we have
p(w∣x,y)=∫p(w)Πi=1nσi(yiw)dwp(w)Πi=1nσi(yiw)
which we cannot calculate directly.
Our strategy is to approximate p(w∣x,y) with an normal distribution and we need a method for setting μ and Σ.
We would use the Laplace approximation to estimate it.
p(w∣x,y)=p(y∣x)p(y,w∣x)=∫p(y,w∣x)dwp(y,w∣x)=∫elnp(y,w∣x)dwelnp(y,w∣x)
Let's define f(w)=lnp(y,w∣x). If we use the Taylor expansion, we could have
f(w)≈f(z)+(w−z)T∇f(z)+21(w−z)T(∇2f(z))(w−z)
where z=wMAP. And since wMAP would leads to ∇f(z)=0, we have
p(w∣x,y)=∫e−21(w−wMAP)T(−∇2f(wMAP))(w−wMAP)dwe−21(w−wMAP)T(−∇2f(wMAP))(w−wMAP)
which makes
μΣ=wMAP=(−∇2lnp(y,wMAP∣x))−1
And what we have is that the posterior distribution would be approximately a gaussian distribution.