Matrix Factorization is a simple way to do collaborative filtering.
MN1×N2=UN1×dVd×N2
The main idea is that d≪min(N1,N2) and is smaller than the rank of M. Actually, it is equivalent to learn a low rank matrix (with rank d) that is similar to M.
The reason why we can learn such a low rank matrix is that:
- We assume many columns in M should look similar
- M is sparse and we can use the low rank matrix to fill the missing data.
Probabilistic Matrix Factorization
Let Ω contain the pairs (i,j) that Mij is non-zero. And Ωui the rated objects for user i, Ωvj the users who rated for object j.
Assume
ui∼N(0,λ−1I),i=1,...,N1vj∼N(0,λ−1I),j=1,...,N2
And
Mij∼N(uiTvj,σ2)
Though for M as rating, it does not make sense to use a normal distribution. However, this assumption works well.
And we need to maximize:
p(Mo,U,V)=[(i,j)∈Ω∏p(Mij∣ui,vj)]×[i=1∏N1p(ui)][j=1∏N2p(vj)]
Where Mo are the observed terms of M.
Notice that for general assumption, this is a problem with missing values and should use EM to solve iteratively. However, applying the gaussian assumption to this posteriori, we have:
L=−(i,j)∈Ω∑2σ21∣∣Mij−uiTvj∣∣2−i=1∑N12λ∣∣ui∣∣2−j=1∑N22λ∣∣vj∣∣2+constant
And we apply derivation on them directly,
∇uiL=j∈Ωui∑σ21(Mij−uiTvj)vj−λui=0∇vjL=i∈Ωvj∑σ21(Mij−uiTvj)ui−λvj=0
Therefore
ui=(λσ2I+j∈Ωui∑vjvjT)−1(j∈Ωui∑Mijvj)vj=(λσ2I+i∈Ωvj∑uiuiT)−1(i∈Ωvj∑Mijui)
And for this, we need a coordinate ascent algorithm to iteratively get the optimum.
Relation with ridge regression
From the objective function, if we see from the vj's point of view, it is basically a ridge regression.
So the model is a set of N1+N2 coupled ridge regression problems.
If we remove the prior term, which makes
vj=(i∈Ωvj∑uiuiT)−1(i∈Ωvj∑Mijui)
It is the least square solution (not MAP any more). But in this case, the assure that the inversibility, each object must be rated by at least d users, which is probably not the case.
Nonnegative Matrix Factorization
Some times, we need U and V nonnegative. For example, let Mij be the number of times word i appears in document j. We have two object function:
Choice 1: Squared error objective
∣∣X−WH∣∣2=i∑j∑(Xij−(WH)ij)2
Choice 2: Divergence objective
D(X∣∣WH)=−i∑j∑[Xijln(WH)ij−(WH)ij]
Notice apart from the objective function, there should be nonnegative values.
And for these problems, we will use multiplicative algorithms to minimize the objective function. The detail algorithms are in Algorithms for non-negative matrix factorization.
And for squared error we have:
Hkj←Hkj(WTWH)kj(WTX)kjWik←Wik(WHHT)ik(XHT)ik
as an iterative algorithm.
Probabilistically, the squared error implies a Gaussian distribution
Xij∼N(ΣkWijHkj,σ2)
It is incorrect for nonnegative value, but still, it works well.
And for divergence objective
Hkj←HkjΣiWikΣiWikXij/(WH)ijWik←WikΣjHkjΣjHkjXij/(WH)ij
as an iterative algorithm.
The probabilistic interpretation is
Xij∼Pois((WH)ij)
Therefore,
ln(P(Xij∣W,H))=lnXij!(WH)ijXije−(WH)ij=Xijln(WH)ij−(WH)ij+constant
Hence, minimizing the divergence objective function is equivalent to maximize the probability.