Clean corpus using Quanteda

Question

What's the Quanteda way of cleaning a corpus like shown in the example below using tm (lowercase, remove punct., remove numbers, stem words)? To be clear, I don't want to create a document-feature matrix with dfm(), I just want a clean corpus that I can use for a specific downstream task.

# This is what I want to do in quanteda
library("tm")
data("crude")
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)

PS I am aware that I could just do quanteda_corpus <- quanteda::corpus(crude)to get what I want, but I would much prefer being able to do everything in Quanteda.

JBGruber JBGruber · Accepted Answer · 2020-08-05T16:20:32

I think what you want to do is deliberately impossible in quanteda.

You can, of course, do the cleaning quite easily without losing the order of words using the tokens* set of functions:

library("tm")
data("crude")
library("quanteda")
toks <- corpus(crude) %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>% 
  tokens_wordstem()

print(toks, max_ndoc = 3)
#> Tokens consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#>  [1] "Diamond"  "Shamrock" "Corp"     "said"     "that"     "effect"  
#>  [7] "today"    "it"       "had"      "cut"      "it"       "contract"
#> [ ... and 78 more ]
#> 
#> reut-00002.xml :
#>  [1] "OPEC"    "may"     "be"      "forc"    "to"      "meet"    "befor"  
#>  [8] "a"       "schedul" "June"    "session" "to"     
#> [ ... and 427 more ]
#> 
#> reut-00004.xml :
#>  [1] "Texaco"   "Canada"   "said"     "it"       "lower"    "the"     
#>  [7] "contract" "price"    "it"       "will"     "pay"      "for"     
#> [ ... and 40 more ]
#> 
#> [ reached max_ndoc ... 17 more documents ]

But it is not possible to return this tokens object into a corpus. Now it would be possible to write a new function to do this:

corpus.tokens <- function(x, ...) {
  quanteda:::build_corpus(
    unlist(lapply(x, paste, collapse = " ")),
    docvars = cbind(quanteda:::make_docvars(length(x), docnames(x)), docvars(x))
  )
}

corp <- corpus(toks)
print(corp, max_ndoc = 3)
#> Corpus consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> "Diamond Shamrock Corp said that effect today it had cut it c..."
#> 
#> reut-00002.xml :
#> "OPEC may be forc to meet befor a schedul June session to rea..."
#> 
#> reut-00004.xml :
#> "Texaco Canada said it lower the contract price it will pay f..."
#> 
#> [ reached max_ndoc ... 17 more documents ]

But this object, while technically being a corpus class object, is not what a corpus is supposed to be. From ?corpus [emphasis added]:

Value

A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.

The object above does not meet this description as the original texts have been processed already. Yet the class of the object communicates otherwise. I don't see a reason to break this logic as all subsequent analyses steps should be possible using either tokens* or dfm* functions.

Clean corpus using Quanteda

1 Answers