I have a tm corpus of documents and a list of words. I want to run a for loop over the corpus, so that the loop removes each word in the list from the corpus sequentially.
Some replication data:
library(tm)
m <- cbind(c("Apple blue two","Pear yellow five","Banana yellow two"),
c(1, 2, 3))
tm_corpus <- Corpus(VectorSource(m[,1]))
words <- as.list(c("Apple", "yellow", "two"))
tm_corpus is now a corpus object consisting of 3 documents:
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
words is a list of 3 words:
[[1]]
[1] "Apple"
[[2]]
[1] "yellow"
[[3]]
[1] "two"
I have tried three different loops. The first one is:
tm_corpusClean <- tm_corpus
for (i in seq_along(tm_corpusClean)) {
for (u in seq_along(words)) {
tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords, words[[u]])
}
}
Which returns the following error 7 times (numbered 1-7):
Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions
In addition: Warning messages:
1: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,
words[[u]]) :
number of items to replace is not a multiple of replacement length
2: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,
words[[u]]) :
number of items to replace is not a multiple of replacement length
[...]
The second one is:
tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
for (u in seq_along(tm_corpusClean)) {
tm_corpusClean[u] <- tm_map(tm_corpusClean[u], removeWords, words[[i]])
}
}
Which returns the error:
Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions
The last loop is:
tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
tm_corpusClean <- tm_map(tm_corpusClean, removeWords, words[[i]])
}
This actually returns an object named tm_corpusClean, but this object only returns the first document instead of all original three:
inspect(tm_corpusClean[[1]])
<<PlainTextDocument>>
Metadata: 7
Content: chars: 6
blue
Where am I going wrong?