I use the Gensim package for topic modelling. The idea is to understand what are the topics in the flickr tags. Till now I am using this code (document are tags):
texts = [[word for word in document.split(";") if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = ldamodel.LdaModel(corpus, id2word=dictionary, alpha = 0.1, num_topics=10)
topic = []
for f in lda.print_topics(num_topics=4, num_words=10):
topic_number = f[0]
keywords = f[1]
keywords = keywords.split(" + ")
keywords_update = {}
for ii in keywords:
ii = str(ii)
keyword = ii[6:]
probab = ii[0:5]
probab = float(probab)
if probab > 0.02:
keywords_update.update({keyword:probab})
topic.append(keywords_update)
print topic
So basically I train the LDA on all my documents and then print the 10 most probable words for every topic. Is it correct? Or do I have to train the data on some part of the documents and then use corpus_lda = lda[corpus] in order to apply the trained model on the unseen documents? If the results are different every time I run the model, does it mean that the amount of the topics is not correct? What is the best way to evaluate the results?
lda[corpus]
. To see which words each topic is most related to, you can print the 10 most probable words for every topic. See here for other functions that can help you print these things. – interpolack