zhuzilin's Blog

about

About Matrix Factorization

date: 2019-04-24
tags: machine-learning  math  

Matrix Factorization is a simple way to do collaborative filtering.

matrix-factorization

MN1×N2=UN1×dVd×N2M_{N_1\times N2}=U_{N_1\times d}V_{d\times N_2}

The main idea is that dmin(N1,N2)d \ll min(N_1, N_2) and is smaller than the rank of MM. Actually, it is equivalent to learn a low rank matrix (with rank dd) that is similar to MM.

The reason why we can learn such a low rank matrix is that:

  • We assume many columns in MM should look similar
  • MM is sparse and we can use the low rank matrix to fill the missing data.

Probabilistic Matrix Factorization

Let Ω\Omega contain the pairs (i,j)(i, j) that MijM_{ij} is non-zero. And Ωui\Omega_{u_i} the rated objects for user ii, Ωvj\Omega_{v_j} the users who rated for object jj.

Assume

uiN(0,λ1I),i=1,...,N1vjN(0,λ1I),j=1,...,N2\begin{aligned} u_i\sim N(0, \lambda^{-1} I), i=1,...,N_1\\ v_j\sim N(0, \lambda^{-1} I), j=1,...,N_2 \end{aligned}

And

MijN(uiTvj,σ2)M_{ij}\sim N(u_i^T v_j, \sigma^2)

Though for MM as rating, it does not make sense to use a normal distribution. However, this assumption works well.

And we need to maximize:

p(Mo,U,V)=[(i,j)Ωp(Mijui,vj)]×[i=1N1p(ui)][j=1N2p(vj)]p(M_o, U, V)=[\prod_{(i,j)\in \Omega}p(M_{ij}|u_i, v_j)]\times [\prod_{i=1}^{N_1}p(u_i)][\prod_{j=1}^{N_2}p(v_j)]

Where MoM_o are the observed terms of MM.

Notice that for general assumption, this is a problem with missing values and should use EM to solve iteratively. However, applying the gaussian assumption to this posteriori, we have:

L=(i,j)Ω12σ2MijuiTvj2i=1N1λ2ui2j=1N2λ2vj2+constant\mathcal{L}=-\sum_{(i, j)\in\Omega}\frac{1}{2\sigma^2}||M_{ij}-u_i^Tv_j||^2 -\sum_{i=1}^{N_1}\frac{\lambda}{2}||u_i||^2-\sum_{j=1}^{N_2}\frac{\lambda}{2}||v_j||^2 +constant

And we apply derivation on them directly,

uiL=jΩui1σ2(MijuiTvj)vjλui=0vjL=iΩvj1σ2(MijuiTvj)uiλvj=0\begin{aligned} \nabla_{u_i}\mathcal{L}=\sum_{j\in\Omega_{u_i}}\frac{1}{\sigma^2}(M_{ij}-u_i^Tv_j)v_j-\lambda u_i=0\\ \nabla_{v_j}\mathcal{L}=\sum_{i\in\Omega_{v_j}}\frac{1}{\sigma^2}(M_{ij}-u_i^Tv_j)u_i-\lambda v_j=0 \end{aligned}

Therefore

ui=(λσ2I+jΩuivjvjT)1(jΩuiMijvj)vj=(λσ2I+iΩvjuiuiT)1(iΩvjMijui)\begin{aligned} u_i=(\lambda\sigma^2I+\sum_{j\in\Omega_{u_i}}v_jv_j^T)^{-1}(\sum_{j\in\Omega_{u_i}}M_{ij}v_j)\\ v_j=(\lambda\sigma^2I+\sum_{i\in\Omega_{v_j}}u_iu_i^T)^{-1}(\sum_{i\in\Omega_{v_j}}M_{ij}u_i)\\ \end{aligned}

And for this, we need a coordinate ascent algorithm to iteratively get the optimum.

Relation with ridge regression

From the objective function, if we see from the vjv_j's point of view, it is basically a ridge regression.

So the model is a set of N1+N2N_1+N_2 coupled ridge regression problems.

If we remove the prior term, which makes

vj=(iΩvjuiuiT)1(iΩvjMijui)v_j=(\sum_{i\in\Omega_{v_j}}u_iu_i^T)^{-1}(\sum_{i\in\Omega_{v_j}}M_{ij}u_i)

It is the least square solution (not MAP any more). But in this case, the assure that the inversibility, each object must be rated by at least dd users, which is probably not the case.

Nonnegative Matrix Factorization

Some times, we need UU and VV nonnegative. For example, let MijM_{ij} be the number of times word ii appears in document jj. We have two object function:

Choice 1: Squared error objective

XWH2=ij(Xij(WH)ij)2\begin{aligned} ||X-WH||^2&=\sum_i\sum_j(X_{ij}-(WH)_{ij})^2 \end{aligned}

Choice 2: Divergence objective

D(XWH)=ij[Xijln(WH)ij(WH)ij]D(X||WH)=-\sum_i\sum_j[X_{ij}ln(WH)_{ij}-(WH)_{ij}]

Notice apart from the objective function, there should be nonnegative values.

And for these problems, we will use multiplicative algorithms to minimize the objective function. The detail algorithms are in Algorithms for non-negative matrix factorization.

And for squared error we have:

HkjHkj(WTX)kj(WTWH)kjWikWik(XHT)ik(WHHT)ik\begin{aligned} H_{kj}\leftarrow H_{kj}\frac{(W^TX)_{kj}}{(W^TWH)_{kj}}\\ W_{ik}\leftarrow W_{ik}\frac{(XH^T)_{ik}}{(WHH^T)_{ik}} \end{aligned}

as an iterative algorithm.

Probabilistically, the squared error implies a Gaussian distribution

XijN(ΣkWijHkj,σ2)X_{ij}\sim N(\Sigma_kW_{ij}H_{kj}, \sigma^2)

It is incorrect for nonnegative value, but still, it works well.

And for divergence objective

HkjHkjΣiWikXij/(WH)ijΣiWikWikWikΣjHkjXij/(WH)ijΣjHkj\begin{aligned} H_{kj}\leftarrow H_{kj}\frac{\Sigma_i W_{ik}X_{ij}/(WH)_{ij}}{\Sigma_i W_{ik}}\\ W_{ik}\leftarrow W_{ik}\frac{\Sigma_j H_{kj}X_{ij}/(WH)_{ij}}{\Sigma_j H_{kj}} \end{aligned}

as an iterative algorithm.

The probabilistic interpretation is

XijPois((WH)ij)X_{ij}\sim Pois((WH)_{ij})

Therefore,

ln(P(XijW,H))=ln(WH)ijXijXij!e(WH)ij=Xijln(WH)ij(WH)ij+constant\begin{aligned} ln(P(X_{ij}|W, H))&=\ln{\frac{(WH)_{ij}^{X_{ij}}}{X_{ij}!}e^{-(WH)_{ij}}}\\ &=X_{ij}\ln{(WH)_{ij}}-(WH)_{ij} + constant \end{aligned}

Hence, minimizing the divergence objective function is equivalent to maximize the probability.