I am using a LDA model on a corpus to learn the topics covered in it. I am using the gensim package (e.g., gensim.models.ldamodel.LdaModel); can easily use other versions of LDA if necessary.
My question is what is the most efficient way to use the parameterized model and/or topic words or topic IDs to find and retrieve new documents that contain the topic?
Concretely, I want to scrape a media API to find new articles (out-of-sample documents) that relate to my topics contained in my original corpus. Because I am doing this 'blind search', running the LDA on each new document may be too cumbersome; most new documents will not contain the topic.
Can of course simply retrieve new documents that contain one to n most of the frequent words of the LDA-learned topics; and then apply LDA to the returned documents for further confidence.
I am wondering if there is a more sophisticated method that gives better confidence that the new out-of-sample articles actually contain the same topic; as opposed to coincidentally containing one or two of the topic words.
Am looking at Topic Tiling algorithms but not sure if they are applicable here.