0
votes

I have a VCorpus, which is extracted like this:

corp <- VCorpus(DirSource("//Filepath"))

I then wanted to delete certain rows from my files within the Corpus that contained a certain word. To do this I converted my Corpus to as.character:

corp <- sapply(corp, as.character)

and then removed all rows including the word FILE:

for(j in seq(corp)) {
  corp[[j]] <- corp[[j]][!grepl("FILE", corp[[j]], ignore.case = FALSE)]
}

Now I want to go back to the class "VCorpus" to use tm_map to perform Corpus cleaning tasks like:

corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)

But I get the following error message:

Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "list"

I have tried several things but I get error messages like:

Error in UseMethod("as.VCorpus") : no applicable method for 'as.VCorpus' applied to an object of class "character"

Any ideas how I can transform back to VCorpus and perform tm_map tasks?

1

1 Answers

0
votes

I don't think you should be setting the corpus to as.character, as it destroys the meta data that makes it a corpus. The text is already contained, for element i of your corpus, in corp[[i]]$content, so you would be better off just working with this directly.

A workflow that works for me would be...

corp <- VCorpus(DirSource("//Filepath"))

for(j in seq(corp)) {
    corp[[j]]$content <- corp[[j]]$content[!grepl("FILE", corp[[j]]$content,ignore.case = FALSE)]
}

corp <- tm_map(corp, content_transformer(tolower))
...etc