After reading my corpus with the Quanteda package, I get the same error when using various subsequent statements:
Error in UseMethod("texts") : no applicable method for 'texts' applied to an object of class "c('corpus_frame', 'data.frame')").
For example, when using this simple statement: texts(mycorpus)[2]
My actual goal is to create a dfm (which give me the same error message as above).
I read the corpus with this code:
`mycorpus < corpus_frame(readtext("C:/Users/renswilderom/Documents/Stuff Im
working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom="filenames", dvsep="_", docvarnames=c("Date of Publication",
"Length LexisNexis"), encoding = "UTF-8-BOM"))`
My dataset consists of 50 newspaper articles, including some metadata such as the date of publication.
Why am I getting this error every time? Thanks very much in advance for your help!
Response 1:
When using just readtext()
I get one step further and texts(text.corpus)[1]
does not yield an error.
However, when tokenizing, the same error occurs again, so:
token <- tokenize(text.corpus, removePunct=TRUE, removeNumbers=TRUE, ngrams
= 1:2)
tokens(text.corpus)
Yields:
Error in UseMethod("tokenize") : no applicable method for 'tokenize' applied to an object of class "c('readtext', 'data.frame')"
Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "c('readtext', 'data.frame')"
Response 2:
Now I get these two error messages in return, which I initially also got, so I started using corpus_frame()
Error in UseMethod("tokens") : no applicable method for 'tokens' applied to an object of class "c('corpus_frame', 'data.frame')"
In addition: Warning message: 'corpus' is deprecated. Use 'corpus_frame' instead. See help("Deprecated")
Do I need to specify that 'tokenization' or any other step is only applied to the 'text' column and not to the entire dataset?
Response 3:
Thank you, Patrick, this does clarify and brought me somewhat further. When running this:
# Quanteda - corpus way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis", "source"),
encoding = "UTF-8-BOM") %>%
corpus() %>%
tokens(removePunct = TRUE, removeNumbers = TRUE, ngrams = 1:2)
I get this:
Error in tokens_internal(texts(x), ...) : the ... list does not contain 3 elements In addition: Warning message: removePunctremoveNumbers is deprecated; use remove_punctremove_numbers instead
So I changed it accordingly (using remove_punct
and remove_numbers
) and now the code runs well.
Alternatively, I also tried this:
# Corpus - term_matrix way
readtext("C:/Users/renswilderom/Documents/Stuff Im working on at the moment/Newspaper articles DJ/test data/*.txt",
docvarsfrom = "filenames", dvsep = "_",
docvarnames = c("Date of Publication", "Length LexisNexis", "source"),
encoding = "UTF-8-BOM") %>%
term_matrix(drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2)
Which gives this error:
Error in term_matrix(., drop_punct = TRUE, drop_numbers = TRUE, ngrams = 1:2) : unrecognized text filter property: 'drop_numbers'
After removing drop_numbers = TRUE
, the matrix is actually produced. Thanks very much for your help!
corpus_frame()
is not a quanteda function, so I cannot help you there. Second, what is the result of just thereadtext()
operation? Does it return a readtext data.frame? If so, thenquanteda::corpus()
should work fine. – Ken Benoitstm
on the result (from packagestm
) – Patrick Perry