Quanteda group documents by multiple variables

Question

I would like to be able to group documents in my dfm by two variables - speaker and week_start. I was previously able to do this using dfm(corpus, groups=c("speaker","week_start"). This worked fine and grouped documents by speaker-week.

However, with the recent updates to the quanteda package I seem to be running into a few problems. So I now create the dfm first then I try to group. Below is the code

dfm <- dfm(corpus)
dfm <- dfm_group(dfm, groups = c(speaker, week_start))

However, when I do this I get the error:

Error: groups must have length ndoc(x)

I have also tried to put the docvars in quotations but this generates the same error.

Ken Benoit Ken Benoit · Accepted Answer · 2021-05-25T15:02:52

We changed the usage of the groups argument in v3 to make it more standard.

From news(Version >= "3.0", package = "quanteda"):

We have added non-standard evaluation for by and groups arguments to access object docvars:

The *_sample() functions' argument by, and groups in the *_group() functions, now take unquoted document variable (docvar) names directly, similar to the way the subset argument works in the *_subset() functions.

Quoted docvar names no longer work, as these will be evaluated literally.

The by = "document" formerly sampled from docid(x), but this functionality is now removed. Instead, use by = docid(x) to replicate this functionality.

For groups, the default is now docid(x), which is now documented more completely. See ?groups and ?docid.

So, to get the previous behaviour, you would want to use:

groups = interaction(speaker, week_start)

Here's an example:

library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- corpus(c(
  "a b c",
  "a c d",
  "c d d",
  "d d e"
),
docvars = data.frame(
  var1 = c("a", "a", "b", "b"),
  var2 = c(1, 2, 1, 1)
)
)
corp %>%
  tokens() %>%
  dfm() %>%
  dfm_group(groups = interaction(var1, var2))
## Document-feature matrix of: 3 documents, 5 features (40.00% sparse) and 2 docvars.
##      features
## docs  a b c d e
##   a.1 1 1 1 0 0
##   b.1 0 0 1 4 1
##   a.2 1 0 1 1 0

Quanteda group documents by multiple variables

1 Answers