0
votes

Gensim's HDP model for topic modeling (gensim.models.hdpmodel.HdpModel) has a constructor that takes an argument called max_chunks.

On the documentation, it says max_chunks is the number of chunks the model will go over, and if that is larger than the number of chunks in supplied corpus, the training will wrap around the corpus.

Since I was warned by INFO logs that the likelihood function has been decreasing, I figure I may need multiple passes on corpus to converge.

LDA model provides with the passes argument the functionality to train on corpus for multiple iterations. I have difficulty figuring out how max_chunks in HDP maps to passes in LDA.

For example, let say my corpus has 1000000 documents. what max_chunks needs to be exactly in order to train, say, 3 passes on my corpus.

Any suggestion? Many many thanks

2

2 Answers

1
votes

The chunksize, passes and also update_every options can be a bit confusing. What helped me was this link and specifically the section Chunksize, Passes, and Update_every

So in your case, if you are doing batch-LDA with update_every set to 0 and chunksize set to the number of documents, with passes set to 3 you should get three passes over the complete corpus.

In case of online-LDA, where update_every is set to 1, you could additionally use the chunksize to control the size of the mini-batches per pass.

0
votes

class gensim.models.hdpmodel.HdpModel(corpus, id2word, max_chunks=None, max_time=None, chunksize=256, kappa=1.0, tau=64.0, K=15, T=150, alpha=1, gamma=1, eta=0.01, scale=1.0, var_converge=0.0001, outputdir=None, random_state=None)

I think if you have 1000000 documents then if you use the default chunksize of 256 you'll need to have max_chunks=100000/256*3 to force 3 passes.

I'm also getting the WARNING : likelihood is decreasing! message, and I think that my corpus is too small (608 short texts) and too uniform to find topics in.