1
votes

Let's assume that we have a moderately growing document corpus i.e. some new documents get added to this document corpus everyday. For these newly added documents, I can infer the topic distributions just by using the inference part of the LDA. I do not have to execute the whole topic estimation + inference process of LDA for all documents again just to get the topic distributions for these new documents. However, over the period of time, I might need to do the whole topic generation process again as the number of documents added newly since the last LDA execution might add totally new words to the document corpus.

Now, the question that I have is - how to determine the good enough interval between two topic generation executions? Are there any general recommendations on how often should we execute the LDA for whole document corpus?

If I keep this interval very short then, I might lose the stable topic distributions and topic distributions will keep changing. If I keep the interval too long then, I might lose the new topics and new topic structures.

1

1 Answers

3
votes

I'm just thinking aloud here... One very simple idea is to sample a subset of documents from the bunch of newly added documents (say over a period of one day).

You could possibly extract key words from each of these documents in the sampled set, and execute each as a query to the index built from a version of the collection that existed before adding these new documents.

You could then measure the average cosine similarity of the top K documents retrieved in response to each query (and average them over each query from the sampled set of queries). If this average similarity is less than a pre-defined threshold, it might indicate that the new documents are not that similar to the existing ones. It might thus be a good idea to rerun LDA on the whole collection.