0
votes

I am doing sentiment analysis, I have two documents in my directory of corpus 1 is of positive tweets and other is of negative tweets but in comparison wordcloud I have words those are stopwords. This means it is not removing the stopwords ("english").
I created custom stopwords but failed to retain that output too. After that I have searched and found a stopwords.txt file of stopwords that I have downloaded from the github and used it to remove the stopwords. For this I have to convert the corpus (atomic vector) to table and then to vector (dataframe) as to read this file. I have combined it with stopwords of tm library.
The output was as expected, but when I tried to remove the punctuation and inspected the corpus, the output was just according to removePunctuation output not retaining the output of stopwords.
Then, I tried the removeNumbers and inspect the corpus but it is not retaining the output of stopwords but retaining the output of removePunctuation. so, what is the problem here?

What I am missing here?
[This is the code]
[1][This is the output after removing the stopwords from the tweets using R]
[2][This is the output after appling other cleaning like removePunctuation, removeNumbers, stipwhitespace, stemDocument but it is not retaining the removed stopwords output]
[3]
[1]: https://i.stack.imgur.com/RMbvD.png
[2]: https://i.stack.imgur.com/18H3P.png
[3]: https://i.stack.imgur.com/SxaJE.png

This is the code that I have used. I have put the two text files in the directory and converted it into the corpus.

library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-
Project/tweets"))
summary(tweets_corpus)
##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)
##removing stopwords##
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords, 
stopwords("english"))
inspect(clean_tweets_corpus)
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
class(stop)
stop
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
class(stop_vec)
stop_vec
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords, 
c(stopwords("english"), stop_vec))
inspect(clean_tweets_corpus)
## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\\b[A-z]\\b{1}"," ",x)
clean_tweets_corpus<-tm_map(tweets_corpus, 
content_transformer(remove_multiplechar))
inspect(clean_tweets_corpus)
clean_tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(tweets_corpus,removeNumbers)
clean_tweets_corpus <- tm_map(tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(tweets_corpus, stemDocument)
inspect(clean_tweets_corpus)
str(clean_tweets_corpus)
1
Your function calls are incorrect. clean_tweets_corpus <- tm_map(tweets_corpus, ...) you are calling the tm_map function with tweets_corpus but saving the results to clean_tweets_corpus. Then next call to tm_map you are still using the original unmodified tweets_corpus and overwriting the updated clean_tweet_corpus.Dave2e
thanks for your reply @Dave2e. what can I do. kindly correct the code please. as I am new to R so cant understand much of thisMahnoor

1 Answers

0
votes

Here is the corrected code, replacing "tweets_corpus" with "clean_tweets_corpus" in all calls to tm_map except the first one:

library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-Project/tweets"))
summary(tweets_corpus)

##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)

##removing stopwords##
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)

clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeWords, 
                              c(stopwords("english"), stop_vec))

## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\\b[A-z]\\b{1}"," ",x)
clean_tweets_corpus<-tm_map(clean_tweets_corpus, 
                            content_transformer(remove_multiplechar))

clean_tweets_corpus <- tm_map(clean_tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeNumbers)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stemDocument)