2
votes

I use the Gensim package for topic modelling. The idea is to understand what are the topics in the flickr tags. Till now I am using this code (document are tags):

    texts = [[word for word in document.split(";") if word not in stoplist] for document in documents]
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    lda = ldamodel.LdaModel(corpus, id2word=dictionary, alpha = 0.1, num_topics=10)
    topic = []
    for f in lda.print_topics(num_topics=4, num_words=10):
        topic_number = f[0]
        keywords = f[1]
        keywords = keywords.split(" + ")
        keywords_update = {}
        for ii in keywords:
            ii = str(ii)
            keyword = ii[6:]
            probab = ii[0:5]
            probab = float(probab)
            if probab > 0.02:
                keywords_update.update({keyword:probab})
        topic.append(keywords_update)
    print topic

So basically I train the LDA on all my documents and then print the 10 most probable words for every topic. Is it correct? Or do I have to train the data on some part of the documents and then use corpus_lda = lda[corpus] in order to apply the trained model on the unseen documents? If the results are different every time I run the model, does it mean that the amount of the topics is not correct? What is the best way to evaluate the results?

1
To see which topics each document is most related to, you need to use lda[corpus]. To see which words each topic is most related to, you can print the 10 most probable words for every topic. See here for other functions that can help you print these things.interpolack

1 Answers

1
votes

The probability distribution of unseen documents will change each time you query the model because the classifier in the model uses a variation of Bayes theorem for statistical inference. Gensim does most of the work for you in regards to getting the topn words from each topic.

This will return a dict of topic_id: {word: probability} with top 10 words of each topic in model.

topn_words = {i: {word: prob for word, prob in lda.show_topic(i, topn=10)} for i in range(0, lda.num_topics)}

When you use lda[unseen_document], it returns a probability distribution vector that is the size of the number of topics in the model, where each value is the probability that the document fits into the topic corresponding to the index of the vector.

Once you have a probability distribution vector for a collection of unseen documents, you can compute similarities between them. Gensim has cosine similarity built-in.

bow = dictionary.doc2bow(tokenize(text))
vec_1 = lda[bow]
bow = dictionary.doc2bow(tokenize(text))
vec_2 = lda[bow]
gensim.matutils.cossim(vec_1, vec_2)

In this example, tokenize is a made up function where you would either use the built-in Gensim simple_preprocess() method or prune and tokenize the text some other way. The dictionary.doc2bow() method requires a list of words and outputs a bag of words or list of tuples (word_id, frequency).