1
votes

I am using LDA for a Topic Modelling task. As suggested in various forums online, I have trained my model on a fairly large corpus : NYTimes news dataset (~ 200 MB csv file) which has reports regarding a wide variety of news topics. Surprisingly the topics predicted out of it are mostly related to US politics and when I test it on a new document regarding 'how to educate children and parenting stuff' it predicts the most likely topic as this :

['two', 'may', 'make', 'company', 'house', 'things', 'case', 'use']

Kindly have a look at my model :

def ldamodel_english(filepath, data):
  data_words = simple_preprocess(str(data), deacc=True)

  # Building the bigram model and removing stopwords
  bigram = Phrases(data_words, min_count=5, threshold=100)
  bigram_mod = Phraser(bigram)
  stop_words_english = stopwords.words('english')
  data_nostops = [[word for word in simple_preprocess(str(doc)) if word not in stop_words_english] 
for doc in data_words]
  data_bigrams = [bigram_mod[doc] for doc in data_nostops]
  data_bigrams = [x for x in data_bigrams if x != []]

  # Mapping indices to words for computation purpose
  id2word = corpora.Dictionary(data_bigrams)
  corpus = [id2word.doc2bow(text) for text in data_bigrams]

  # Building the LDA model. The parameters 'alpha' and 'eta' handle the number of topics per document and words per topic respectively
  lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=20, random_state=10, iterations=100,
                                            update_every=1, chunksize=1000, passes=8, alpha=0.09, per_word_topics=True, eta=0.8)
  print('\nPerplexity Score: ' + str(lda_model.log_perplexity(corpus)) + '\n')
  for i, topic in lda_model.show_topics(formatted=True, num_topics=20, num_words=10):
      print('TOPIC #' + str(i) + ': ' + topic + '\n')
  coherence_model_lda = CoherenceModel(model=lda_model, texts=data_bigrams, dictionary=id2word, coherence='c_v')
  print('\nCoherence Score: ', coherence_model_lda.get_coherence())
  saved_model_path = os.path.join(filepath, 'ldamodel_english')
  lda_model.save(saved_model_path)

return saved_model_path, corpus, id2word

The 'data' part comes from the 'Content' section of the NYTimes News dataset and I used GENSIM library for LDA.

My question is if a well trained LDA model predicts so badly why there is such a hype and what is an effective alternative method?

1

1 Answers

1
votes

It can be a perfectly valid output of the model. Given the source texts which are not necessary related to "children education and parenting" the topic that was found to be the most similar might just be very rudimentarily similar to the article. It is likely that there is not much of the vocabulary in common between NY Times articles and your article. So the words that made the topic distinctive among the topics typical for NY Times might have very little in common with your article. In fact, the only words that are shared may be really rather typical of anything as in your case.

This is happening frequently when the corpus used for training the LDA model has little to do with the documents it is applied to later. So there is really not much surprise here. The size of the corpus does not help as what matters is the vocabulary/topical overlap.

I suggest that you either change the number of topics and the corpus or find a suitable corpus to train LDA on (that contains texts related to the documents you intend to classify).