You've got a tool/API (Gensim LDA) that, when given a document, gives you a list of topics.
But you want the reverse: a list of documents, for a topic.
Essentially, you'll want to build the reverse-mapping yourself.
Fortunately Python's native dicts & idioms for working with mapping make this pretty simple - just a few lines of code - as long as you're working with data that fully fits in memory.
Very roughly the approach would be:
- create a new structure (
dict
or list
) for mapping topics to lists-of-documents
- iterate over all docs, adding them (perhaps with scores) to that topic-to-docs mapping
- finally, look up (& perhaps sort) those lists-of-docs, for each topic of interest
If your question could be edited to include more information about the format/IDs of your documents/topics, and how you've trained your LDA model, this answer could be expanded with more specific example code to build the kind of reverse-mapping you'd need.
Update for your code update:
OK, if your model is in ldamodel
and your BOW-formatted docs in corpus
, you'd do something like:
# setup: get the model's topics in their native ordering...
all_topics = ldamodel.print_topics()
# ...then create a empty list per topic to collect the docs:
docs_per_topic = [[] for _ in all_topics]
# now, for every doc...
for doc_id, doc_bow in enumerate(corpus):
# ...get its topics...
doc_topics = ldamodel.get_document_topics(doc_bow)
# ...& for each of its topics...
for topic_id, score in doc_topics:
# ...add the doc_id & its score to the topic's doc list
docs_per_topic[topic_id].append((doc_id, score))
After this, you can see the list of all (doc_id, score)
values for a certain topic like this (for topic 0):
print(docs_per_topic[0])
If you're interested in the top docs per topic, you can further sort each list's pairs by their score:
for doc_list in docs_per_topic:
doc_list.sort(key=lambda id_and_score: id_and_score[1], reverse=True)
Then, you could get the top-10 docs for topic 0 like:
print(docs_per_topic[0][:10])
Note that this does everything using all-in-memory lists, which might become impractical for very-large corpuses. In some cases, you might need to compile the per-topic listings into disk-backed structures, like files or a database.