2
votes

Given a standard LDA model with few 1000 topics and few millions of documents, trained with Mallet / collapsed Gibbs sampler:

When inferring a new document: Why not just skip sampling and simply use the term-topic counts of the model to determine the topic assignments of the new document? I understand that applying the Gibbs sampling on the new document is taking into account the topic mixture of the new document which in turn influence how topics are composed (beta, term-freq. distributions). However as topics are kept fixed when inferring a new document, i don't see why this should be relevant.

An issue with sampling is the probabilistic nature - sometimes documents topic assignments inferred, greatly vary on repeated invocations. Therefore i would like to understand the theoretical and practical value of the sampling vs. just using a deterministic method.

Thanks Ben

2
Note that with MALLET you can make the probabilistic samplic deterministic by setting the random seed using the flag --random-seedjk - Reinstate Monica
yes that's true. My question is more to understand the theoretical reasoning behind the sampling. When i set the seed results become deterministic but given they are sometimes quite different when the seed is not set, is that the correct approach?Ben
At one time one usually commits to a certain topic model and starts interpreting it. In order to replicate it, you need to keep record of the random seed. It is very instructive to compute several topic models with different random seeds. It immunises against the "This was computed by a computer and must be believed" fallacy.jk - Reinstate Monica
First thanks for your comments. I believe for semantic consistency w.r.t. topic composition it should be irrelevant. Maybe the counts are slightly different, but that should not affect the semantic consistency across random seeds. My issue is however semantic inconsistency in the inference phase.Ben
It's tempting to think of inferred topic allocations in documents as "solutions", but really we should think of them as a sample from the posterior distribution. Fixing the seed actually introduces a complex bias, you're just picking particular samples, some of which may be far from the mean/median. For example, for topic-word probabilities that are low, the posterior is quite spread out, and a particular sample cannot be taken as a "solution", as other samples could have quite different values.drevicko

2 Answers

3
votes

Just using term topic counts of the last Gibbs sample is not a good idea. Such an approach doesn't take into account the topic structure: if a document has many words from one topic, it's likely to have even more words from that topic [1].

For example, say two words have equal probabilities in two topics. The topic assignment of the first word in a given document affects the topic probability of the other word: the other word is more likely to be in the same topic as the first one. The relation works the other way also. The complexity of this situation is why we use methods like Gibbs sampling to estimate values for this sort of problem.

As for your comment on topic assignments varying, that can't be helped, and could be taken as a good thing: if a words topic assignment varies, you can't rely on it. What you're seeing is that the posterior distribution over topics for that word has no clear winner, so you should take a particular assignment with a grain of salt :)

[1] assuming beta, the prior on document-topic distributions, encourages sparsity, as is usually chosen for topic models.

3
votes

The real issue is computational complexity. If each of N tokens in a document can have K possible topics, there are K to the N possible configurations of topics. With two topics and a document the size of this answer, you have more possibilities than the number of atoms in the universe.

Sampling from this search space is, however, quite efficient, and usually gives consistent results if you average over three to five consecutive Gibbs sweeps. You get to do something computationally impossible, and what it costs you is some uncertainty.

As was noted, you can get a "deterministic" result by setting a fixed random seed, but that doesn't actually solve anything.