0
votes

Working with a document term matrix in R seems to be truncating the words.
I create a document term matrix from a corpus like below:

library(tm)

docs <- c("All that we are is the result of what we have thought.",
          "Wisely, and slow. They stumble that run fast.",
          "The future belongs to those who prepare for it today.",
          "Our life is frittered away by detail... simplify, simplify.",
          "Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.")

myCorpus <- Corpus(VectorSource(docs))

ndocs <- length(myCorpus)
minTermFreq <- 0.05 * ndocs
maxTermFreq <- 0.6 * ndocs

myDTM <- DocumentTermMatrix(myCorpus,
                            control = list(stopwords = TRUE,
                                           wordLengths=c(3, Inf),
                                           removePunctuation = TRUE,
                                           removeNumbers = TRUE,
                                           tolower=TRUE,
                                           stemming = TRUE,
                                           remove_separators = TRUE,
                                           bounds = list(global = c(minTermFreq, maxTermFreq))
                                           )
                            )

When I look at the terms, longer ones are truncated, but not consistently:

myDTM[["dimnames"]][["Terms"]]

#  [1] "absolut"   "away"      "beauti"    "belong"    "better"   
#  [6] "bore"      "detail"    "fast"      "fritter"   "futur"    
# [11] "genius"    "imperfect" "it’"       "life"      "mad"      
# [16] "prepar"    "result"    "ridicul"   "run"       "simplifi" 
# [21] "slow"      "stumbl"    "thought"   "today"     "wise" 

"Absolutely" is truncated to 7 characters, while "beauty" is truncated to 6. What's the fix for this? Or am I missing something obvious?

1

1 Answers

0
votes

You have stemmed the words by using the option stemming = TRUE.

You can either set this to false to avoid stemming, meaning that words such as stumble, stumbles and stumbled will all be counted separately, or complete the stems using stemCompletion. This will replace the stems with the most common option from the text by default (though you can change the behaviour with the type parameter.