Working with a document term matrix in R seems to be truncating the words.
I create a document term matrix from a corpus like below:
library(tm)
docs <- c("All that we are is the result of what we have thought.",
"Wisely, and slow. They stumble that run fast.",
"The future belongs to those who prepare for it today.",
"Our life is frittered away by detail... simplify, simplify.",
"Imperfection is beauty, madness is genius and it’s better to be absolutely ridiculous than absolutely boring.")
myCorpus <- Corpus(VectorSource(docs))
ndocs <- length(myCorpus)
minTermFreq <- 0.05 * ndocs
maxTermFreq <- 0.6 * ndocs
myDTM <- DocumentTermMatrix(myCorpus,
control = list(stopwords = TRUE,
wordLengths=c(3, Inf),
removePunctuation = TRUE,
removeNumbers = TRUE,
tolower=TRUE,
stemming = TRUE,
remove_separators = TRUE,
bounds = list(global = c(minTermFreq, maxTermFreq))
)
)
When I look at the terms, longer ones are truncated, but not consistently:
myDTM[["dimnames"]][["Terms"]]
# [1] "absolut" "away" "beauti" "belong" "better"
# [6] "bore" "detail" "fast" "fritter" "futur"
# [11] "genius" "imperfect" "it’" "life" "mad"
# [16] "prepar" "result" "ridicul" "run" "simplifi"
# [21] "slow" "stumbl" "thought" "today" "wise"
"Absolutely" is truncated to 7 characters, while "beauty" is truncated to 6. What's the fix for this? Or am I missing something obvious?