I have a question when using topic models like pLSA/LDA: how to inference the topic distribution of a new document after we got the distribution for each words in each topics? I have tried "fold-in" Gibbs Sampling when using LDA, but when the unseen document is very short this method doesn't work because the randomness assignment of the topic to each words contained in the document. For example, considering a model with two topics, there's a token w which p(w|z1)=0.09 and p(w|z2) = 0.01. Then a document which contains only one word w, it's p(z|d) will be (1.0, 0) mostly and (0, 1.0) sometimes because somehow the sampling procedure will assign the topic of w to topic2. How can we deal with this situation?
1 Answers
I am not sure what you mean with "randomness", because after applying the Gibbs sampling the topics should not be random, they should make some sense. Maybe you executed the algorithm fewer times than necessary?
In addition, in the case that you have only two topics, the probabilities should sum to 1. It seems logical that if for a given token w the probabilities are 0.9 and 0.1 for z1 and z2 respectively, then 90% of the times this word will be classified to z1 and 10% to z2. Although the document with only w is an extreme case, I believe that the above will still hold.
I don't understand your problem completely, but there are also other ways to approximate LDA, for instance variational algorithms.
This might help you do the inference for a new instance.