1
votes

I was trying this for Latent Dirichlet allocation implemenation but getting repeated terms.How can I unique terms from LDA?

library(tm)
Loading required package: NLP
myCorpus <- Corpus(VectorSource(tweets$text))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
removeURL <- function(x) gsub("http[^[:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]
", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
myStopwords <- c(stopwords('english'), "available", "via")
myStopwords <- setdiff(myStopwords, c("r", "big"))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpusCopy <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
library('SnowballC')
myCorpus <- tm_map(myCorpus, stemDocument)
dtm<-DocumentTermMatrix(myCorpus)
library("RTextTools", lib.loc="~/R/win-library/3.2")
library("topicmodels", lib.loc="~/R/win-library/3.2")
om1<-LDA(dtm,30)
terms(om1)

This is the output

2
Welcome to SO. What's tweets$text? Please provide a minimal reproducible example.lukeA
i have use that code before and text.csv contain 500 tweets text > tweets = read.csv("text.csv")Aman Gupta

2 Answers

3
votes

According to https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation In LDA, each document is viewed as a mixture of various topics. That is for each document (tweet) we get the probability of the tweet belonging to each topic. The probability sums to 1.

Similarly each topic is be viewed as a mixture of various terms(words). That is for each topic we get the probability of each word belonging to the topic. The probability sums to 1. Hence for every word topic combination there is a probability assigned. The code terms(om1) gets the word with the highest probability for each topic.

So in your case you are finding same word having the highest probability in multiple topics. This is not an error.

The below code will create TopicTermdf dataset which has the distribution of all the words for each topic. Looking at the dataset, will help you understand better.

The below code is based on the following LDA with topicmodels, how can I see which topics different documents belong to? post.

Code:

# Reproducible data - From Coursera.org John Hopkins Data Science Specialization Capstone project, SwiftKey Challange dataset

tweets <- c("How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.",
           "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.",
           "they've decided its more fun if I don't.",
           "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)",
           "Words from a complete stranger! Made my birthday even better :)",
           "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!",
           "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing",
           "I'm coo... Jus at work hella tired r u ever in cali",
           "The new sundrop commercial ...hehe love at first sight",
           "we need to reconnect THIS WEEK")


library(tm)
myCorpus <- Corpus(VectorSource(tweets))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
removeURL <- function(x) gsub("http[^[:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumPunct))
myStopwords <- c(stopwords('english'), "available", "via")
myStopwords <- setdiff(myStopwords, c("r", "big"))
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpusCopy <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
library('SnowballC')
myCorpus <- tm_map(myCorpus, stemDocument)
dtm<-DocumentTermMatrix(myCorpus)

library(RTextTools)
library(topicmodels)
om1<-LDA(dtm,3)

Output:

> # Get the top word for each topic 
> terms(om1) 
Topic 1 Topic 2 Topic 3 
"youll"   "cub" "anoth" 
> 
> #Top word for each topic
> colnames(TopicTermdf)[apply(TopicTermdf,1,which.max)]
[1] "youll" "cub"   "anoth"

> 
1
votes

Try to find the optimal number of topics. For this, you need to build multiple LDA models with different numbers of topics and pick one of them with the highest coherence score. If you seeing the same keyword(terms) being repeated in multiple topics, it's probably a sign that the value of k(number of topics) is too large. Although it is written in python, well here is the link to LDA topic modeling you will find grid-search method to find optimal value (to decide a number of topics to take).