0
votes

I'm trying to make a DocumentTermMatrix in R, using the parameter control = list() to limit the terms to a pre-defined list of text-based emojis (:D, :), :(, etc.). However, dtm doesn't pick up certain emojis (like ":D" or ":)"), but some other works fine (":))") . My code:

text = c(":D", ":))" ) 
corpus <- Corpus(VectorSource(text)
corpus = tm_map(corpus, PlainTextDocument)
dtm = DocumentTermMatrix(corpus, list(dictionary = c(":D" , ":))" )))
emojidf <- as.data.frame(as.matrix(dtm))

  :D :))
1  0   0
2  0   1

To fix this, I could use content_transformer and gsub to change the problematic emojis to words. However, I'd like to know how DocumentTermMatrix or even Corpus treat punctuation as words.

1

1 Answers

0
votes

Two issues (see ?DocumentTermMatrix and ?termFreq): The wordLengths filter by default demands a minimum word length of 3 characters. And tolower by default turns :D into :d. So try:

library(tm)
text <- c(":D", ":))" ) 
corpus <- Corpus(VectorSource(text))
dtm <- DocumentTermMatrix(
  corpus, 
  control = list(
    dictionary = c(":D" , ":))"), 
    wordLengths=c(-Inf,Inf), 
    tolower=FALSE
  )
)
as.matrix(dtm)
#     Terms
# Docs :)) :D
#    1   0  1
#    2   1  0