Trouble accessing quanteda corpus quantities in version >= 2

Question

I am having a problem when running the same script I have written before. Back then, when I applied quanteda::corpus on a readtext object, it returned a "corpus" and "list" class object. But when I run the same script it returns "corpus" and "character" class objects now. And this affects the subsequent codes. What could be the reason for this and how can I solve this issue?

Here is the script:

txt <- readtext("C:/Users/aerol/Desktop/txt_sample")
corpus_txt <- corpus(txt) %>%
  corpus_reshape(to = "sentences")

docvars(corpus_txt, "Treaty") <- corpus_txt$documents$`_document`
docvars(corpus_txt, "Year") <- as.integer(stri_sub(corpus_txt$documents$`_document`, -9, -6))

The files are international treaties. All the filenames are in the same format, they contain the name of the treaty and the year it was signed. And I was extracting these.

Back then the the class of corpus txt was "corpus" "list":

> class(corpus_txt)
[1] "corpus" "list"

But now:

> class(corpus_txt)
[1] "corpus"    "character"
> packageVersion("quanteda")
[1] ‘2.1.2’

And I cannot extract information from the corpus the way I did before. Since I was working on this since the last October I should be using the same version all along.

Many thanks in advance.

Ken Benoit Ken Benoit · Accepted Answer · 2021-01-12T08:16:43

We changed the corpus internal structure in v2, after two years of warning in the documentation that users should not access the corpus internals directly, or their code would not likely work under future major versions.

From https://github.com/quanteda/quanteda/blob/master/NEWS.md#quanteda-20:

quanteda 2.0 introduces some major changes, detailed here.

New corpus object structure.

The internals of the corpus object have been redesigned, and now are based around a character vector with meta- and system-data in attributes. These are all updated to work with the existing extractor and replacement functions. If you were using these before, then you should not even notice the change. Docvars are now handled separately from the texts, in the same way that docvars are handled for tokens objects.

From ?corpus:

For quanteda >= 2.0, this is a specially classed character vector. It has many additional attributes but you should not access these attributes directly, especially if you are another package author. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change. Using the accessor and replacement functions ensures that future code to manipulate corpus objects will continue to work.

Solution? Use docnames(corpus_txt).

Trouble accessing quanteda corpus quantities in version >= 2

1 Answers