I think what you want to do is deliberately impossible in quanteda
.
You can, of course, do the cleaning quite easily without losing the order of words using the tokens*
set of functions:
library("tm")
data("crude")
library("quanteda")
toks <- corpus(crude) %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
tokens_wordstem()
print(toks, max_ndoc = 3)
#> Tokens consisting of 20 documents and 15 docvars.
#> reut-00001.xml :
#> [1] "Diamond" "Shamrock" "Corp" "said" "that" "effect"
#> [7] "today" "it" "had" "cut" "it" "contract"
#> [ ... and 78 more ]
#>
#> reut-00002.xml :
#> [1] "OPEC" "may" "be" "forc" "to" "meet" "befor"
#> [8] "a" "schedul" "June" "session" "to"
#> [ ... and 427 more ]
#>
#> reut-00004.xml :
#> [1] "Texaco" "Canada" "said" "it" "lower" "the"
#> [7] "contract" "price" "it" "will" "pay" "for"
#> [ ... and 40 more ]
#>
#> [ reached max_ndoc ... 17 more documents ]
But it is not possible to return this tokens
object into a corpus. Now it would be possible to write a new function to do this:
corpus.tokens <- function(x, ...) {
quanteda:::build_corpus(
unlist(lapply(x, paste, collapse = " ")),
docvars = cbind(quanteda:::make_docvars(length(x), docnames(x)), docvars(x))
)
}
corp <- corpus(toks)
print(corp, max_ndoc = 3)
But this object, while technically being a corpus
class object, is not what a corpus is supposed to be. From ?corpus
[emphasis added]:
Value
A corpus class object containing the original texts, document-level variables, document-level metadata, corpus-level
metadata, and default settings for subsequent processing of the
corpus.
The object above does not meet this description as the original texts have been processed already. Yet the class of the object communicates otherwise. I don't see a reason to break this logic as all subsequent analyses steps should be possible using either tokens*
or dfm*
functions.