R: having trouble using quanteda corpus with readtext

Question

After reading my corpus with the Quanteda package, I get the same error when using various subsequent statements:

Error in UseMethod("texts") : no applicable method for 'texts' applied to an object of class "c('corpus_frame', 'data.frame')").

For example, when using this simple statement: texts(mycorpus)[2] My actual goal is to create a dfm (which give me the same error message as above).

I read the corpus with this code:

`mycorpus < corpus_frame(readtext("C:/Users/renswilderom/Documents/Stuff Im 
working on at the moment/Newspaper articles DJ/test data/*.txt", 
docvarsfrom="filenames", dvsep="_", docvarnames=c("Date of Publication", 
"Length LexisNexis"), encoding = "UTF-8-BOM"))`

My dataset consists of 50 newspaper articles, including some metadata such as the date of publication.

See screenshot.

Why am I getting this error every time? Thanks very much in advance for your help!

Response 1:

When using just readtext() I get one step further and texts(text.corpus)[1] does not yield an error.

However, when tokenizing, the same error occurs again, so:

token <- tokenize(text.corpus, removePunct=TRUE, removeNumbers=TRUE, ngrams 
= 1:2)
tokens(text.corpus)

Yields:

Error in UseMethod("tokenize") : no applicable method for 'tokenize' applied to an object of class "c('readtext', 'data.frame')"

Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "c('readtext', 'data.frame')"

Response 2:

Now I get these two error messages in return, which I initially also got, so I started using corpus_frame()

Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "c('corpus_frame', 'data.frame')"

In addition: Warning message: 'corpus' is deprecated. Use 'corpus_frame' instead. See help("Deprecated")

Do I need to specify that 'tokenization' or any other step is only applied to the 'text' column and not to the entire dataset?

Response 3:

Thank you, Patrick, this does clarify and brought me somewhat further. When running this:

# Quanteda - corpus way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM")  %>%
  corpus() %>%
  tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)

I get this:

Error in tokens_internal(texts(x), ...) : the ... list does not contain 3 elements In addition: Warning message: removePunctremoveNumbers is deprecated; use remove_punctremove_numbers instead

So I changed it accordingly (using remove_punct and remove_numbers) and now the code runs well.

Alternatively, I also tried this:

# Corpus - term_matrix way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
         docvarsfrom = "filenames", dvsep = "_", 
         docvarnames = c("Date of Publication", "Length LexisNexis", "source"), 
         encoding = "UTF-8-BOM")  %>%
  term_matrix(drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2)

Which gives this error:

Error in term_matrix(., drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2) : unrecognized text filter property: 'drop_numbers'

After removing drop_numbers = TRUE, the matrix is actually produced. Thanks very much for your help!

First, corpus_frame() is not a quanteda function, so I cannot help you there. Second, what is the result of just the readtext() operation? Does it return a readtext data.frame? If so, then quanteda::corpus() should work fine. — Ken Benoit
Once you get the data into a documents-by-term count matrix, you can fit a topic model by calling stm on the result (from package stm) — Patrick Perry

Patrick Perry Patrick Perry · Accepted Answer · 2017-10-12T14:22:40

To clarify the situation:

Versions 0.9.1 of the corpus package had a function called corpus. quanteda also has a function called corpus. To avoid the name clash between the two packages, the corpus corpus function got deprecated and renamed to corpus_frame in version 0.9.2; it was removed in version 0.9.3.

To avoid the name clash with quanteda, either upgrade to corpus to the latest version on CRAN (0.9.3), or else do

library(corpus)
library(quanteda)

Instead of the other order.

Now, if you want to use quanteda to tokenize your texts, follow the advice given in Ken's answer:

readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
     docvarsfrom = "filenames", dvsep = "_", 
     docvarnames = c("Date of Publication", "Length LexisNexis"), 
     encoding = "UTF-8-BOM"))  %>%
    corpus() %>%
    tokens(remove_punct = TRUE, remove_numbers = TRUE, ngrams = 1:2)

You may want to use the dfm function instead of the tokens function if your goal is to get a document-by-term count matrix.

If you want to use the corpus package, instead do

readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
     docvarsfrom = "filenames", dvsep = "_", 
     docvarnames = c("Date of Publication", "Length LexisNexis"), 
     encoding = "UTF-8-BOM"))  %>%
    term_matrix(drop_punct = TRUE, drop_number = TRUE, ngrams = 1:2)

Depending on what you're trying to do, you might want to use the term_stats function instead of the term_matrix function.

R: having trouble using quanteda corpus with readtext

2 Answers