2
votes

Given a training corpus docsWithFeatures, I've trained an LDA model in Spark (via Scala API) like so:

import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
val n_topics = 10;
val lda = new LDA().setK(n_topics).setMaxIterations(20)
val ldaModel = lda.run(docsWithFeatures)

val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]

And now I want to report the log-likelihood and perplexity of the model.

I can get the log-likelihood like so:

scala> distLDAModel.logLikelihood
res11: Double = -2600097.2875547716

But this is where things get weird. I also wanted the perplexity, which is only implemented for a local model, so I run:

val localModel  = distLDAModel.toLocal

Which lets me get the (log) perplexity like so:

scala> localModel.logPerplexity(docsWithFeatures)
res14: Double = 0.36729132682898674

But the local model also supports the log-likelihood calculation, which I run like this:

scala> localModel.logLikelihood(docsWithFeatures)
res15: Double = -3672913.268234148

So what's going on here? Shouldn't the two log-likelihood values be the same? The documentation for a distributed model says

"logLikelihood: log likelihood of the training corpus, given the inferred topics and document-topic distributions"

while for a local model it says:

"logLikelihood(documents): Calculates a lower bound on the provided documents given the inferred topics."

I guess these are different, but it's not clear to me how or why. Which one should I use? That is, which one is the "true" likelihood of the model, given the training documents?

To summarize, two main questions:

1 - How and why are the two log-likelihood values different, and which should I use?

2 - When reporting perplexity, am I correct in thinking that I should use the exponential of the logPerplexity result? (But why does the model give log perplexity instead of just plain perplexity? Am I missing something?)

1

1 Answers

5
votes

1) These two log-likelihood values differ because they are computing the log-likelihood for two different models. DistributedLDAModel is effectively computing the log-likelihood w.r.t. a model where the parameters for the topics and the mixing weights for each of the documents are constants (as I mentioned in another post, the DistributedLDAModel is essentially regularized PLSI, though you need to use logPrior to also account for the regularization), while the LocalLDAModel takes the view that the topic parameters as well as the mixing weights for each document are random variables. So in the case of LocalLDAModel you have to integrate (marginalize) out the topic parameters and document mixing weights in order to compute the log-likelihood (and this is what makes the variational approximation/lower bound necessary, though even without the approximation the log-likelihoods would not be the same since the models are just different.)

As far as which one you should use, my suggestion (without knowing what you ultimately want to do) would be to go with the log-likelihood method attached to the class you originally trained (i.e. the DistributedLDAModel.) As a side note, the primary (only?) reason that I can see to convert a DistributedLDAModel into a LocalLDAModel via toLocal is to enable the computation of topic mixing weights for a new (out-of-training) set of documents (for more info on this see my post on this thread: Spark MLlib LDA, how to infer the topics distribution of a new unseen document?), a operation which is not (but could be) supported in DistributedLDAModel.

2) log-perplexity is just the negative log-likelihood divided by the number of tokens in your corpus. If you divide the log-perplexity by math.log(2.0) then the resulting value can also be interpreted as the approximate number of bits per a token needed to encode your corpus (as a bag of words) given the model.