Gensim HDP topic model: How to train on multiple passes of corpus?

Question

Gensim's HDP model for topic modeling (gensim.models.hdpmodel.HdpModel) has a constructor that takes an argument called max_chunks.

On the documentation, it says max_chunks is the number of chunks the model will go over, and if that is larger than the number of chunks in supplied corpus, the training will wrap around the corpus.

Since I was warned by INFO logs that the likelihood function has been decreasing, I figure I may need multiple passes on corpus to converge.

LDA model provides with the passes argument the functionality to train on corpus for multiple iterations. I have difficulty figuring out how max_chunks in HDP maps to passes in LDA.

For example, let say my corpus has 1000000 documents. what max_chunks needs to be exactly in order to train, say, 3 passes on my corpus.

Any suggestion? Many many thanks

tkja tkja · Accepted Answer · 2017-04-03T19:38:05

The chunksize, passes and also update_every options can be a bit confusing. What helped me was this link and specifically the section Chunksize, Passes, and Update_every

So in your case, if you are doing batch-LDA with update_every set to 0 and chunksize set to the number of documents, with passes set to 3 you should get three passes over the complete corpus.

In case of online-LDA, where update_every is set to 1, you could additionally use the chunksize to control the size of the mini-batches per pass.

Gensim HDP topic model: How to train on multiple passes of corpus?

2 Answers