Topic models: cross validation with loglikelihood or perplexity

Question

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60.

I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set.

I found this code from one of CV's discussion sessions. I really don't understand several lines of codes below. I have dtm matrix using the holdout set (20 documents). But I don't know how to calculate the perplexity or log likelihood of this holdout set.

Questions:

Can anybody explain to me what seq(2, 100, by =1) mean here? Also, what AssociatedPress[21:30] mean? What function(k) is doing here?
```
best.model <- lapply(seq(2, 100, by=1), function(k){ LDA(AssociatedPress[21:30,], k) })
```
If I want to calculate perplexity or log likelihood of the holdout set called dtm, is there better code? I know there are perplexity() and logLik() functions but since I'm new I can not figure out how to implement it with my holdout matrix, called dtm.
How can I do ten fold cross validation with my corpus, containing 200 documents? Is there existing code that I can invoke? I found caret for this purpose, but again cannot figure that out either.

What in the documentation for ?seq and ?AssociatedPress and the other functions did you not understand? — probabilityislogic
I updated the code for this and saved as a gist. has plot method that prints by default. devtools::source_url("https://gist.githubusercontent.com/trinker/594bd132b180a43945f7/raw/70fbb1aa2a9113837a9a9f8a6c43d884c2ef5bd0/optimal_k%25202") +1 nice answer. — Tyler Rinker

Peter Ellis Peter Ellis · Accepted Answer · 2017-01-04T21:47:42

The accepted answer to this question is good as far as it goes, but it doesn't actually address how to estimate perplexity on a validation dataset and how to use cross-validation.

Using perplexity for simple validation

Perplexity is a measure of how well a probability model fits a new set of data. In the topicmodels R package it is simple to fit with the perplexity function, which takes as arguments a previously fit topic model and a new set of data, and returns a single number. The lower the better.

For example, splitting the AssociatedPress data into a training set (75% of the rows) and a validation set (25% of the rows):

# load up some R packages including a few we'll need later
library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)

data("AssociatedPress", package = "topicmodels")

burnin = 1000
iter = 1000
keep = 50

full_data  <- AssociatedPress
n <- nrow(full_data)
#-----------validation--------
k <- 5

splitter <- sample(1:n, round(n * 0.75))
train_set <- full_data[splitter, ]
valid_set <- full_data[-splitter, ]

fitted <- LDA(train_set, k = k, method = "Gibbs",
                          control = list(burnin = burnin, iter = iter, keep = keep) )
perplexity(fitted, newdata = train_set) # about 2700
perplexity(fitted, newdata = valid_set) # about 4300

The perplexity is higher for the validation set than the training set, because the topics have been optimised based on the training set.

Using perplexity and cross-validation to determine a good number of topics

The extension of this idea to cross-validation is straightforward. Divide the data into different subsets (say 5), and each subset gets one turn as the validation set and four turns as part of the training set. However, it's really computationally intensive, particularly when trying out the larger numbers of topics.

You might be able to use caret to do this, but I suspect it doesn't handle topic modelling yet. In any case, it's the sort of thing I prefer to do myself to be sure I understand what's going on.

The code below, even with parallel processing on 7 logical CPUs, took 3.5 hours to run on my laptop:

#----------------5-fold cross-validation, different numbers of topics----------------
# set up a cluster for parallel processing
cluster <- makeCluster(detectCores(logical = TRUE) - 1) # leave one CPU spare...
registerDoParallel(cluster)

# load up the needed R package on all the parallel sessions
clusterEvalQ(cluster, {
   library(topicmodels)
})

folds <- 5
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(2, 3, 4, 5, 10, 20, 30, 40, 50, 75, 100, 200, 300) # candidates for how many topics

# export all the needed R objects to the parallel sessions
clusterExport(cluster, c("full_data", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))

# we parallelize by the different number of topics.  A processor is allocated a value
# of k, and does the cross-validation serially.  This is because it is assumed there
# are more candidate values of k than there are cross-validation folds, hence it
# will be more efficient to parallelise
system.time({
results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar%{
   k <- candidate_k[j]
   results_1k <- matrix(0, nrow = folds, ncol = 2)
   colnames(results_1k) <- c("k", "perplexity")
   for(i in 1:folds){
      train_set <- full_data[splitfolds != i , ]
      valid_set <- full_data[splitfolds == i, ]

      fitted <- LDA(train_set, k = k, method = "Gibbs",
                    control = list(burnin = burnin, iter = iter, keep = keep) )
      results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))
   }
   return(results_1k)
}
})
stopCluster(cluster)

results_df <- as.data.frame(results)

ggplot(results_df, aes(x = k, y = perplexity)) +
   geom_point() +
   geom_smooth(se = FALSE) +
   ggtitle("5-fold cross-validation of topic modelling with the 'Associated Press' dataset",
           "(ie five different models fit for each candidate number of topics)") +
   labs(x = "Candidate number of topics", y = "Perplexity when fitting the trained model to the hold-out set")

We see in the results that 200 topics is too many and has some over-fitting, and 50 is too few. Of the numbers of topics tried, 100 is the best, with the lowest average perplexity on the five different hold-out sets.

Topic models: cross validation with loglikelihood or perplexity

2 Answers

Using perplexity for simple validation

Using perplexity and cross-validation to determine a good number of topics