0
votes

tm is throwing an error when I try to create a document term matrix

library(tm)
data(crude)

#control parameters
dtm.control <- list(
    tolower           = TRUE, 
    removePunctuation = TRUE,
    removeNumbers     = TRUE,
    stopWords         = stopwords("english"),
    stemming          = TRUE, # false for sentiment
    wordLengths       = c(3, "inf"))

dtm <- DocumentTermMatrix(corp, control = dtm.control)

Error:

Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 'i, j, v' different lengths In addition: Warning messages: 1: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code 2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : NAs introduced by coercion

What am I doing wrong? Also:

I am using these tutorials:

Are there better/ more recent walkthroughs?

1

1 Answers

0
votes

You might consider a few changes in your code, especially removeStopWords and creating a corpus. Below worked for me:

library(tm)
data("crude")

#control parameters
dtm.control <- list(
  tolower           = TRUE, 
  removePunctuation = TRUE,
  removeNumbers     = TRUE,
  removestopWords   = TRUE,
  stemming          = TRUE, # false for sentiment
  wordLengths       = c(3, "inf"))

corp <- Corpus(VectorSource(crude))

dtm <- DocumentTermMatrix(corp, control = dtm.control)

> inspect(dtm)
<<DocumentTermMatrix (documents: 20, terms: 848)>>
Non-/sparse entries: 1877/15083
Sparsity           : 89%
Maximal term length: 16
Weighting          : term frequency (tf)