zhuzilin's Blog

about

About LDA

date: 2019-04-25
tags: machine-learning  math  

Probabilistic Topic Model

  • Learns distributions on words called “topics” shared by documents
  • Learns a distribution on topics for each document
  • Assigns every word in a document to a topic

LDA

There are two essential ingredients to latent Dirichlet allocation (LDA)

  1. A collection of distributions on words (the distribution of words on certain topic).
  2. A distribution on topics for each document.

The generative process for LDA is:

  1. Generate each topic, which is a distribution on words

    βkDirichlet(γ),k=1,...K\beta_k\sim Dirichlet(\gamma), k=1, ...K

    notice that βk\beta_k is a VV dimension vector where VV is the size of the vocabulary. And the γ\gamma is also VV dimensional parameter.

  2. For each document, generate a distribution on topics
θdDirichlet(α),d=1,...,D\theta_d\sim Dirichlet(\alpha), d=1, ..., D
  1. For the nnth word in the ddth document:

    (a) Allocate the word to a topic, cdnDiscrete(θd)c_{dn}\sim Discrete(\theta_d)

    (b) Generate the word from the selected topic, xdnDiscrete(βcdn)x_{dn}\sim Discrete(\beta_{c_{dn}})

In this way, we generate all documents.

Before we talk about how to inference of the parameters, we need to know the meaning of this model.

Unigram Model

First, let's put the topic model aside, and find out the difference between the frequentist or Bayesian.

For a unigram model, we assume the vocabulary size is VV. And a document is d=w=(w1,w2,...,wn)d=\vec{w}=(w_1, w_2, ..., w_n). The frequency of words are n=(n1,n2,...,nV)\vec{n}=(n_1, n_2, ..., n_V).

Frequentist

For a frequentist point of view, there is only one dice with VV faces. And the probability of the words in the corpus is:

p(n)=Mult(np,N)=(nn)k=1Vpknkp(\vec{n})=Mult(\vec{n}|\vec{p}, N)=\left(\begin{array}{c} n\\ \vec{n} \end{array}\right) \prod_{k=1}^Vp_k^{n_k}

And using MLE to maximize the probability, we have

p^i=niN\hat{p}_i=\frac{n_i}{N}

Bayesian

The Bayesian thinks the dice has a prior. And since Dirichlet distribution is the conjugate distribution of multinomial distribution, the priori is picked as Dirichlet distribution.

Dir(pα)=1Δ(α)k=1Vpkαk1Dir(\vec{p}|\vec{\alpha})=\frac{1}{\Delta(\vec{\alpha})}\prod_{k=1}^Vp_k^{\alpha_k-1}

And the posteriori is

p(pW,α)=Dir(pn+α)=1Δ(n+α)k=1Vpkn+αk1p(\vec{p}|W, \vec{\alpha})=Dir(\vec{p}|\vec{n}+\vec{\alpha})=\frac{1}{\Delta(\vec{n}+\vec{\alpha})}\prod_{k=1}^Vp_k^{\vec{n}+\alpha_k-1}

And we could maximize the posteriori or use the expectation. If we use the expectation,

p^i=ni+αiΣi=1V(ni+αi)\hat{p}_i=\frac{n_i+\alpha_i}{\Sigma_{i=1}^V(n_i+\alpha_i)}

Topic Model

Now let's take the topic into account. The corpus consists of DD documents, C=(d1,d2,...,dD)C=(d_1, d_2, ..., d_D).

This time, we have 2 kinds of dices. The first one has KK faces, corresponds to KK topics and the second one has VV faces, corresponds to VV words.

Frequentist (PLSA)

For frequentist, we have DD dice of the first kind, one for each document, named as θ1,...,θD\vec{\theta_1}, ..., \vec{\theta_D}. And KK for the second, named as β1,...,βK\beta_1, ..., \beta_K. We first roll the first kind to decide the topic and use the topic to decide the word.

In this way, for a given word ww in document mm we have:

p(wdm)=k=1Kp(wk)p(kdm)=k=1Kβkwθmkp(w|d_m)=\sum_{k=1}^Kp(w|k)p(k|d_m)=\sum_{k=1}^K\beta_{kw}\theta_{mk}

And the probability of a document is:

p(wdm)=i=1nmk=1Kp(wik)p(kdm)=i=1nmk=1Kβkwiθmk\begin{aligned} p(\vec{w}|d_m)&=\prod_{i=1}^{n_m}\sum_{k=1}^Kp(w_i|k)p(k|d_m)\\ &=\prod_{i=1}^{n_m}\sum_{k=1}^K\beta_{kw_i}\theta_{mk} \end{aligned}

This is similar to GMM and can be solved with EM.

Bayesian (LDA)

For Bayesian, still, there are Dirichlet priori for each dices. And the difference between LDA and PLSA would be the same as the difference between the Bayesian and frequentist in the unigram model.

Notice the property of Dirichlet distribution. When the parameter is small, it tends to generate sparse vector. Which makes the assumption that few words is relevant to a topic and a document is relevant to few topics, which is a very important factor why IDA works.

Inference

It is common to use Gibbs sampling to solve the LDA rather than EM. We will talk about is some time later.

Reference

  1. https://zhuanlan.zhihu.com/p/31470216