date: 2019-04-25
tags: machine-learning math
There are two essential ingredients to latent Dirichlet allocation (LDA)
The generative process for LDA is:
Generate each topic, which is a distribution on words
notice that is a dimension vector where is the size of the vocabulary. And the is also dimensional parameter.
For the th word in the th document:
(a) Allocate the word to a topic,
(b) Generate the word from the selected topic,
In this way, we generate all documents.
Before we talk about how to inference of the parameters, we need to know the meaning of this model.
First, let's put the topic model aside, and find out the difference between the frequentist or Bayesian.
For a unigram model, we assume the vocabulary size is . And a document is . The frequency of words are .
For a frequentist point of view, there is only one dice with faces. And the probability of the words in the corpus is:
And using MLE to maximize the probability, we have
The Bayesian thinks the dice has a prior. And since Dirichlet distribution is the conjugate distribution of multinomial distribution, the priori is picked as Dirichlet distribution.
And the posteriori is
And we could maximize the posteriori or use the expectation. If we use the expectation,
Now let's take the topic into account. The corpus consists of documents, .
This time, we have 2 kinds of dices. The first one has faces, corresponds to topics and the second one has faces, corresponds to words.
For frequentist, we have dice of the first kind, one for each document, named as . And for the second, named as . We first roll the first kind to decide the topic and use the topic to decide the word.
In this way, for a given word in document we have:
And the probability of a document is:
This is similar to GMM and can be solved with EM.
For Bayesian, still, there are Dirichlet priori for each dices. And the difference between LDA and PLSA would be the same as the difference between the Bayesian and frequentist in the unigram model.
Notice the property of Dirichlet distribution. When the parameter is small, it tends to generate sparse vector. Which makes the assumption that few words is relevant to a topic and a document is relevant to few topics, which is a very important factor why IDA works.
It is common to use Gibbs sampling to solve the LDA rather than EM. We will talk about is some time later.