I'm trying to make a DocumentTermMatrix in R, using the parameter control = list() to limit the terms to a pre-defined list of text-based emojis (:D, :), :(, etc.). However, dtm doesn't pick up certain emojis (like ":D" or ":)"), but some other works fine (":))") . My code:
text = c(":D", ":))" )
corpus <- Corpus(VectorSource(text)
corpus = tm_map(corpus, PlainTextDocument)
dtm = DocumentTermMatrix(corpus, list(dictionary = c(":D" , ":))" )))
emojidf <- as.data.frame(as.matrix(dtm))
:D :))
1 0 0
2 0 1
To fix this, I could use content_transformer and gsub to change the problematic emojis to words. However, I'd like to know how DocumentTermMatrix or even Corpus treat punctuation as words.