0
votes

I am using a LDA model on a corpus to learn the topics covered in it. I am using the gensim package (e.g., gensim.models.ldamodel.LdaModel); can easily use other versions of LDA if necessary.

My question is what is the most efficient way to use the parameterized model and/or topic words or topic IDs to find and retrieve new documents that contain the topic?

Concretely, I want to scrape a media API to find new articles (out-of-sample documents) that relate to my topics contained in my original corpus. Because I am doing this 'blind search', running the LDA on each new document may be too cumbersome; most new documents will not contain the topic.

Can of course simply retrieve new documents that contain one to n most of the frequent words of the LDA-learned topics; and then apply LDA to the returned documents for further confidence.

I am wondering if there is a more sophisticated method that gives better confidence that the new out-of-sample articles actually contain the same topic; as opposed to coincidentally containing one or two of the topic words.

Am looking at Topic Tiling algorithms but not sure if they are applicable here.

1

1 Answers

1
votes

I do not think you can search in the topic space without transforming everything in the topic space. One could argue about creating functions that return the similarity in the topic space without transforming in the topic space (for instance with neural networks) but I think it is beyond the scope of the question.

Now since the above is not really helpful there are a lot of methods one can think of that will generate candidates better than simple keyword existence and I will write a couple of them.

Use the topics as documents

The topics are simply distributions over the words so you could use them as documents and compute the cosine similarity between them and a test document to get an estimate for the topic's probability in the document.

Use example documents

You could use k documents from the training set for each topic as examples and compute the similarity of those documents with a test document to get an estimate for the topic's probability in the document.

Use similarity hashing

With both of the above techniques you could also use locality sensitive hashing, for instance simhash, to more efficiently generate candidates from large corpora.

To make my last point clearer, I would use the following pipeline (in pseudo python)

# t is a topic
ht = simhash(t) # few bits here
candidates = []
final_texts = []
for text in new_texts:
    if simhash(text) == ht:
        candidates.append(text)
for text in candidates:
    topic_distribution = lda.infer(text)
    if argmax(topic_distribution) == t:
        final_textx.append(text)