2
votes

Is there a way in python to map documents belonging to a certain topic. For example a list of documents that are primarily "Topic 0". I know there are ways to list topics for each document but how do I do it the other way around?

Edit:

I am using the following script for LDA:

    doc_set = []
    for file in files:
        newpath = (os.path.join(my_path, file)) 
        newpath1 = textract.process(newpath)
        newpath2 = newpath1.decode("utf-8")
        doc_set.append(newpath2)

    texts = []
    for i in doc_set:
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        stopped_tokens = [i for i in tokens if not i in stopwords.words()]
        stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
        texts.append(stemmed_tokens)

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, random_state=0, id2word = dictionary, passes=1)
1
Welcome to StackOverflow! Please take the time to read this post on how to How do I ask a good question? as well as how to provide a minimal, Complete, and Verifiable example and revise your question accordinglyyatu
Who silently deleted all my comments?gojomo

1 Answers

1
votes

You've got a tool/API (Gensim LDA) that, when given a document, gives you a list of topics.

But you want the reverse: a list of documents, for a topic.

Essentially, you'll want to build the reverse-mapping yourself.

Fortunately Python's native dicts & idioms for working with mapping make this pretty simple - just a few lines of code - as long as you're working with data that fully fits in memory.

Very roughly the approach would be:

  • create a new structure (dict or list) for mapping topics to lists-of-documents
  • iterate over all docs, adding them (perhaps with scores) to that topic-to-docs mapping
  • finally, look up (& perhaps sort) those lists-of-docs, for each topic of interest

If your question could be edited to include more information about the format/IDs of your documents/topics, and how you've trained your LDA model, this answer could be expanded with more specific example code to build the kind of reverse-mapping you'd need.

Update for your code update:

OK, if your model is in ldamodel and your BOW-formatted docs in corpus, you'd do something like:

# setup: get the model's topics in their native ordering...
all_topics = ldamodel.print_topics()
# ...then create a empty list per topic to collect the docs:
docs_per_topic = [[] for _ in all_topics]

# now, for every doc...
for doc_id, doc_bow in enumerate(corpus):
    # ...get its topics...
    doc_topics = ldamodel.get_document_topics(doc_bow)
    # ...& for each of its topics...
    for topic_id, score in doc_topics:
        # ...add the doc_id & its score to the topic's doc list
        docs_per_topic[topic_id].append((doc_id, score))

After this, you can see the list of all (doc_id, score) values for a certain topic like this (for topic 0):

print(docs_per_topic[0])

If you're interested in the top docs per topic, you can further sort each list's pairs by their score:

for doc_list in docs_per_topic:
    doc_list.sort(key=lambda id_and_score: id_and_score[1], reverse=True)

Then, you could get the top-10 docs for topic 0 like:

print(docs_per_topic[0][:10])

Note that this does everything using all-in-memory lists, which might become impractical for very-large corpuses. In some cases, you might need to compile the per-topic listings into disk-backed structures, like files or a database.