I have a question regarding the language pre-processing in quanteda R. I want to generate a document-feature matrix based on some documents. So, I generated a corpus and run the following code.
data <- read.csv2("abstract.csv", stringsAsFactors = FALSE)
corpus<-corpus(data, docid_field = "docname", text_field = "documents")
dfm <- dfm(corpus, stem = TRUE, remove = stopwords('english'),
remove_punct = TRUE, remove_numbers = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE)
When I examined the dfm I noticed some tokens (#ml, @attribut, _iq, 0.01ms). I rather want to have (ml, attribut, iq, ms).
I thought I deleted all the symbols and numbers. Why do I still get them?
I'd be glad to get some help.
Thanks!!!
tokensit says that, e.g.remove_numberswill remove tokens (words) that consist only of numbers, but not numbers that appear alongside other characters. You might be better off taking these numbers and other characters out of your data using something like thestringrpackage if that is what you need. - Andrew Gustar