This semester, all three of my courses mentioned linear regression. And I believe it would be nice to summarize it.
For linear regression, we will use a linear model to fit y, which is
xyi≈f(xi;w)=w0+j=1∑dxijwj
where x∈Rd and y∈R.
Least Square LR
Model definition
And for the least square linear regression, as the name suggested, has a least square object:
wLS=argwmini=1∑n(yi−f(xi;w))2≡argwminL
Till now, we believe there is a linear relationship between xi and yi
yi=w0+j=1∑dxijwj+ϵi
And we want to minimize
L=i=1∑nϵ2
And we can use matrix to simplify the denotation (To make it simple, we would add a 1 to the first row).
xi=⎣⎢⎢⎢⎢⎡1xi1xi2...xid⎦⎥⎥⎥⎥⎤(d+1)×1
And the data could be represent as one matrix
X=⎣⎢⎢⎢⎡11⋮1x11x21⋮xn1.........x1dx2d⋮xnd⎦⎥⎥⎥⎤n×(d+1)=⎣⎢⎢⎢⎡11⋮1−x1T−−x2T−⋮−xnT−⎦⎥⎥⎥⎤n×(d+1)
And also for w
w=⎣⎢⎢⎢⎡w0w1⋮wd⎦⎥⎥⎥⎤(d+1)×1
For now, let's assume that d<n.
And the vector version of LR becomes:
L=i=1∑n(yi−xiTw)2=∣∣y−Xw∣∣2=(y−Xw)T(y−Xw)
And the derivative would be:
∇wL=2XTXw−2XTy=0⇒wLS=(XTX)−1XTy
How to calculate matrix derivation is here
And the prediction is
ynew≈xnewTwLS
Solution Existence
When XTX is a full rank matrix, which means, X has at least d+1 linear independent row, the solution exists.
And for polynomial regression, we could just take them as extra column for X. More generally, we could use the kernel method:
yi≈f(xi,w)=s=1∑Sgs(xi)ws
As long as the function is linear on w, the solution will always be (XTX)−1XTy.
Geometric interpretation
The columns of X define a d+1-dimensional subspace in the higher dimensional Rn. The closest point in that subspace is the orthonormal projection of y into the column space of X.
Probabilistic interpretation
Assume
p(y∣μ,σ2I)=N(μ=Xw,σ2I)
Then the maximum likelihood (ML) estimator would be
wML=argwmaxlnp(y∣Xw,σ2I)=argwmin∣∣y−Xw∣∣2
Therefore, using least square is making an independent Gaussian noise assumption about error.
And also from this probabilistic assumption
E[wML]=(XTX)−1XTE[y]=(XTX)−1XTXw=w
wML is unbiased. And similarly we could calculate the variance:
Var[wML]=σ2(XTX)−1
Therefore, when σ2(XTX)−1 is large, wML may be huge. And to prevent this, we need to introduce regularization.
Ridge Regression
For the general linear regression with regularization, the model is
wLS=argwmin∣∣y−Xw∣∣2+λg(w)
where λ>=0, g(w)>=0.
And ridge regression is one of them,
wLS=argwmin∣∣y−Xw∣∣2+λ∣∣w∣∣2
And we could solve the answer
L=∣∣y−Xw∣∣2+λ∣∣w∣∣2=(y−Xw)T(y−Xw)+λwTw
wRR=(λI+XTX)−1XTy
There is a tradeoff between square error and penalty on the size of w.
E[wRR]=Zw
and
Var[wRR]=σ2Z(XTX)−1ZT
where Z=(I+λ(XTX)−1)−1.
RR gives us a solution that is biased, but has lower variance than LS.
Data preprocessing
For ridge regression, we assume the following preprocessing:
- The mean is subtracted off of y
y←y−n1i=1∑nyi
- The dimension of xi have been standardized before constructing X:
xij←(xij−xˉ.j)/σ^j
- We can show that there is no need for the dimension of 1's in this case.
Probabilistic interpretation
If we set the prior distribution for w as w∼N(0,λ−1I), then
p(w)=(2πλ)2de−2λwTw
And to maximize a posteriori (MAP) estimation:
wMAP=argwmaxlnp(w∣y,X)=argwmaxlnp(y∣X)p(y∣w,X)p(w)=argwmaxlnp(y∣w,X)+lnp(w)−lnp(y∣X)=argwmax−2σ21(y−Xw)T(y−Xw)−2λwTw+const.=(λσ2I+XTX)−1XTy
which is wRR (we could set λσ2 as λ).
Therefore, RR maximize the posterior.
SVD
For n×d matrix X (assume n>d), we could do singular value decomposition to it
X=USVT
where
- U: n×d and orthonormal in the columns, i.e. UTU=I
- S: d×d non-negative diagonal matrix , i.e. Sii>=0 and Sij=0 for i=j
- V: d×d and orthonormal, i.e. VTV=VVT=I
We have
wLS=(XTX)−1XTy=VS−2VTVSVTy=VS−1UTy
And
wRR=VSλ−1UTy
where
Sλ−1=⎣⎢⎡λ+S112S110⋱0λ+Sdd2Sdd⎦⎥⎤
And define the degrees of freedom as:
df(λ)=tr[X(XTX+λI)−1XT]=i=1∑dλ+Sii2Sii2
If λ=0, df=d, which is the dimension of the data and if λ→∞, df=0. Therefore we could observe how each value in w changes. Here is an example from ESL
Bias-Variance for linear regression
E[(y0−x0Tw^)2∣X,x0]=E[y02]−2x0TE[y0w^]+x0TE[w^w^T]x0=(σ2+(x0Tw)2)−0+x0T(Var[w^]+E[w^]E[w^]T)x0=σ2+x0T(w−E[w^])(w−E[w^])T)x0+x0TVar[w^]x0=noise+square bias+variance
Active Learning
If we can only measure for limited extra times, for example, 5 times more, then how could we best utilize them?
We need to measure the value that we are least sure about, which is the value with the maximal variance.
- Form predictive distribution p(y0∣x0,y,X)forallunmeasuredx_0 \in D$
- Pick the x0 for which σ02 is largest and measure y0
- Update the posterior p(w∣y;X) where y←(y;y0) and X←(X;x0)
- Return to #1 using the updated posterior
And for LR, it is maximizing x0TΣx0.
In fact, this process is minimizing the entropy, which is
H(p)=−∫p(z)lnp(z)dz
And the entropy for multivariate Gaussian is:
H(N(w∣μ,Σ))=21ln((2πe)d∣Σ∣)
And adding x0 would make the variance increase by
(λI+σ−2XTX)−1≡Σ⇒(λI+σ−2(x0x0TXTX))−1≡(Σ−1+σ−2x0x0T)−1
And the change in entropy is:
Hpost=Hprior−2dln(1+σ−2x0TΣx0)
Hence, this method is a greedy algorithm to minimize entropy.
Undetermined Linear Equations
If there are more dimensions than observations, i.e. d≫n. And now w has an infinite number of solutions satisfying y=Xw.
⎣⎡y⎦⎤=⎣⎡X⎦⎤⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡w⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
Least Norm
One possible solution is
wln=XT(XXT)−1y⇒Xwln=y
We can show that wln is the one with smallest l2 norm.
Proof: If there is another solution w , we have X(w−wln)=0. Also,
(w−wln)Twln=(w−wln)TXT(XXT)−1y=(X(w−wln))T(XXT)−1y=0
And this orthogonal relation gives us
∣∣w∣∣2=∣∣w−wln+wln∣∣2=∣∣w−wln∣∣2+∣∣wln∣∣2>∣∣wln∣∣2
Lagrange Multiplier
Introduce Lagrange multipliers:
L(w,η)=wTw+ηT(Xw−y)
We will maximize over η and minimize over w. (So much like how to derive the dual form for linear programming!). And this would force Xw=y.
And the optimality gives us
∇wL=2w+XTη=0,∇ηL=Xw−y=0
- From the first condition we have: w=−XTη/2
- Plug #1 into second condition to find: η=−2(XXT)−1y
- Plug #2 back into #1 to find: wln=XT(XXT)−1y
LASSO
LS and RR are not suited for high-dimensional data, because:
- They weight all dimensions without favoring subsets of dimensions.
- The unknown “important” dimensions are mixed in with irrelevant ones.
- They generalize poorly to new data, weights may not be interpretable.
- LS solution not unique when d > n, meaning no unique predictions.
One intuition why RR is not good for high dimensional data is that the quadratic penalty will reduce more when reducing the larger terms, and will leave all terms of w small but not 0.
And to find a sparse solution (solution with many 0s), we need to use a linear penalty term. And that is why we introduce LASSO.
Model definition
wlasso=argwmin∣∣y−Xw∣∣22+λ∣∣w∣∣1
where ∣∣w∣∣1=∑i=1d∣wj∣.
With this penalty, the cost reduction does not depend on the magnitude of wj, which helps to get a sparse solution.
From the above figure, we could observe that with the increasing of λ, which is the decreasing of df(λ), the weight will goes to 0 and will stay at 0. (In some special case, they may leave 0, but it's rare).
And more generally speaking, there is lp regression, where p=1 is LASSO and p=2 is RR. And notice when p<1, the penalty is no longer convex and we cannot get an global optimum.
Greedy Sparse Regression
We will use the LS to help pick the sparse solution. And the algorithm we used here is called orthogonal matching pursuits, as its name suggests, we would gradually add the Xj, which is the column of X to our solution if Xj has the smallest angle with the residue. To be more specific, there are two steps:
-
Find the least squares solution and the residual,
wLSk=(XIkTXIk)−1XIkTy, r(k)=y−XIkwLS(k)
-
Add the column of X to Ik that correlates the most with the error,
Pick jth column of X, (called Xj), where j=argmaxj′∣∣Xj′∣∣2∣∣r(k)∣∣2∣Xj′Tr(k)∣
Least Absolute Deviation
Model Definition
wLAD=argwmini=1∑n∣yi−f(xi;w)∣
It is no explicit solution for LAD, but we can find one using linear programming.
Probabilistic Interpretation
If we assume the error has a Laplace distribution, which is
f(y∣μ,b)=2b1eb−∣x−μ∣
Then its ML solution would be LAD.