I'm sure that many of you have seen this before:
Warnmeldung:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
This time, I get the error, when I try to remove a custom stopword list from my corpus.
asdf <- tm_map(asdf, removeWords ,mystops)
It works with small stopword list (I tried until 100 or something), but my current stopword list has about 42000 words.
I have tried this:
asdf <- tm_map(asdf, removeWords ,mystops, lazy=T)
this won't give me back an error, however every tm_map command after this will give me the same error above and when I try to compute a DTM from the corpus:
Fehler in UseMethod("meta", x) :
nicht anwendbare Methode für 'meta' auf Objekt der Klasse "try-error" angewendet
Zusätzlich: Warnmeldung:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
I am thinking about a function, looping the removeWords command for little parts of my list, but I would too like to understand, why the length of the list is a problem..
Here my sessionInfo():
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6
locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] SnowballC_0.5.1 wordcloud_2.5 RColorBrewer_1.1-2 RTextTools_1.4.2 SparseM_1.74 topicmodels_0.2-4 tm_0.6-2
[8] NLP_0.1-9
loaded via a namespace (and not attached):
[1] Rcpp_0.12.7 splines_3.3.2 MASS_7.3-45 tau_0.0-18 prodlim_1.5.7 lattice_0.20-34 foreach_1.4.3
[8] tools_3.3.2 caTools_1.17.1 nnet_7.3-12 parallel_3.3.2 grid_3.3.2 ipred_0.9-5 glmnet_2.0-5
[15] e1071_1.6-7 iterators_1.0.8 modeltools_0.2-21 class_7.3-14 survival_2.39-5 randomForest_4.6-12 Matrix_1.2-7.1
[22] lava_1.4.5 bitops_1.0-6 codetools_0.2-15 maxent_1.3.3.1 rpart_4.1-10 slam_0.1-38 stats4_3.3.2
[29] tree_1.0-37
EDIT:
I use 20news-bydate.tar.gz and only the train folder.
I won't share all the preprocessing I am doing, as it includes a morphological analysis of the whole thing (not with R).
Here my R code:
library(tm)
library(topicmodels)
library(SnowballC)
asdf <- Corpus(DirSource("/path/to/20news-bydate/train",encoding="UTF-8"),readerControl=list(language="en"))
asdf <- tm_map(asdf, content_transformer(tolower))
asdf <- tm_map(asdf, removeWords, stopwords(kind="english"))
asdf <- tm_map(asdf, removePunctuation)
asdf <- tm_map(asdf, removeNumbers)
asdf <- tm_map(asdf, stripWhitespace)
# until here: preprocessing
# building DocumentTermMatrix with term frequency
dtm <- DocumentTermMatrix(asdf, control=list(weighting=weightTf))
# building a matrix from the DTM and wordvector (all words as titles,
# sorted by frequency in corpus) and wordlengths (length of actual
# wordstrings in the wordvector)
m <- as.matrix(dtm)
wordvector <- sort(colSums(m),decreasing=T)
wordlengths <- nchar(names(wordvector))
names(wordvector[wordlengths>22]) -> mystops1 # all words longer than 22 characters
names(wordvector)[wordvector<3] -> mystops2 # all words with occurence <3
mystops <- c(mystops1,mystops2) # the stopwordlist
# going back to the corpus to remove the chosen words
asdf <- tm_map(asdf, removeWords ,mystops)
This is where I get the error.
removeWordsconcatenate all words into a regular expression (separated by the or pipe|). I dunno where the character limit is, but I guess a few thounsand words is clearly too much. In addition, please edit your post and make the example reproducible as asked by the R tag (hover over it).tmhas an example corpusdata("crude")and you can easily create artificial stopwords using e.g.stringi::stri_rand_strings. - lukeA