I am trying to optimize an LDA topic model using collapsed Gibbs sampling. I have been using the ldatuning
package in R to optimize the number of topics k:
controls_tm <- list(
burnin = 1000,
iter = 4000,
thin = 500,
nstart = 5,
seed = 0:4,
best = TRUE
)
num_cores <- max(parallel::detectCores() - 1, 1)
result <- FindTopicsNumber(my_dfm, topics = seq(40, 100, by = 5), metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), mc.cores = num_cores, control = controls_tm, verbose = TRUE)
This is all fine. Now I can run topicmodels
in R for a given k with the same controls but it takes ~8 hours to run per model, even on a HPC cluster with 27 cores. I am looking for Python implementations of LDA topic models that I can run with the same controls so that it is consistent with what I used to optimize ldatuning
, but faster, because I need to run multiple models to compare perplexity.
I have looked at the lda
library in Python which uses Gibbs and takes <1 hour per model. But as far as I can tell, I cannot pass it the burnin or thin parameters.
I have also looked at sklearn.decomposition.LatentDirichletAllocation
but it uses variational Bayes instead of Gibbs and it also doesn't look like it accepts burnin or thin anyway. Same goes for gensim
(I think -- I am not very familiar with it).
Does this just not exist in Python? Or is there a workaround so that I can run a model in Python with Gibbs sampling and the parameters I want? Thanks!
method="Gibss"
in Python? – Henry Navarro