2
votes

I have a question regarding the language pre-processing in quanteda R. I want to generate a document-feature matrix based on some documents. So, I generated a corpus and run the following code.

data <- read.csv2("abstract.csv", stringsAsFactors = FALSE)
corpus<-corpus(data, docid_field = "docname", text_field = "documents")
dfm <- dfm(corpus, stem = TRUE, remove = stopwords('english'),
           remove_punct = TRUE, remove_numbers = TRUE, 
           remove_symbols = TRUE, remove_hyphens = TRUE)

When I examined the dfm I noticed some tokens (#ml, @attribut, _iq, 0.01ms). I rather want to have (ml, attribut, iq, ms).

I thought I deleted all the symbols and numbers. Why do I still get them?

I'd be glad to get some help.

Thanks!!!

1
If you check the help for tokens it says that, e.g. remove_numbers will remove tokens (words) that consist only of numbers, but not numbers that appear alongside other characters. You might be better off taking these numbers and other characters out of your data using something like the stringr package if that is what you need. - Andrew Gustar

1 Answers

1
votes

For really fine control you will want to process the text yourself through pattern replacement. Using stringi (or stringr) you can replace Unicode categories for symbols or punctuation easily.

Consider this example.

txt <- "one two, #ml @attribut _iq, 0.01ms."

quanteda::tokens(txt, remove_twitter = TRUE, remove_punct = TRUE)
## tokens from 1 document.
## text1 :
## [1] "one"      "two"      "ml"       "attribut" "_iq"      "0.01ms"

That's an easy way to remove the special characters that might indicate "Twitter" or other social media conventions.

For more low-level control:

# how to remove the leading _ (just to demonstrate)
stringi::stri_replace_all_regex(txt, "(\\b)_(\\w+)", "$1$2")
## [1] "one two, #ml @attribut iq, 0.01ms."

# remove all digits
(txt <- stringi::stri_replace_all_regex(txt, "\\d", ""))
## [1] "one two, #ml @attribut _iq, .ms."
# remove all punctuation and symbols
(txt <- stringi::stri_replace_all_regex(txt, "[\\p{p}\\p{S}]", ""))
## [1] "one two ml attribut iq ms"

quanteda::tokens(txt)
## tokens from 1 document.
## text1 :
## [1] "one"      "two"      "ml"       "attribut" "iq"       "ms"

Which is what you are aiming for, I am (partly) guessing.