3
votes

I would like to do LDA topic modeling on a 9GB corpus. The plan is to train LDA model using MALLET for 1000 iterations with 100 topics, optimizing hyperparameters every 10 iterations after a 200 iteration burn-in period. I am working on 64-bit Win8, computer has 16GB RAM, Intel® Core™ i7-4720HQ Processor. Can anyone tell me how much time should I expect this to take? Are we talking about hours or days? This is the first question I am asking here, so if I've skipped some important info, please let me know.

2

2 Answers

5
votes

So, just in case there is someone interested, in the end I have run the topic modeling (as detailed in question), and it took almost two days for it to finish (1day 20hours).

3
votes

The exact time will vary based on the complexity of the corpus. Sampling will start to go faster as the model begins to fit better, since uncertainty will go down. I would guess probably on the order a day to get a good model.

Importing data may be the most challenging part. The "bulkload" command is designed to reduce memory footprint for imports that consist of a large file with one document per line. This command will also do vocabulary pruning based on word frequency.

For a corpus of this size with hyperparameter optimization, consider using more topics. Using 500 topics will probably take no longer than 100 topics, for the same reason that sampling will go faster as the model fits better.