0
votes

I am using R's quanteda package and the latest versions for both R and the package. I have a corpus of documents which number in the millions.

Let's suppose I have a DFM generated from quanteda with each document having a docvar of the date. There are thousands of documents generated in a given day, but I want to obtain the DFMs applied to the documents by day (so that I have total word counts for each term by day). I know that quanteda is built using data.table, so it should be possible to do this, but I have found little in the "Getting Started with Quanteda" or on StackOverflow that gives a clean way of doing this.

Any suggestions?

1

1 Answers

1
votes

You want the 'groups' argument to dfm:

> # Add some random dates to an existing corpus
> docvars(data_corpus_inaugural)$date <- rep(as.Date(runif(19, 1, 18000), origin='1970-01-01'), 3)

> dfm_inaugural <- dfm(data_corpus_inaugural, groups='date')
> head(dfm_inaugural)
Document-feature matrix of: 19 documents, 9,215 features (80.8% sparse).
(showing first 6 documents and first 6 features)
            features
docs         fellow citizens  i appear before you
  1970-12-27      4        7 39      2     10  17
  1972-04-25      8       13 29      1      8   8
  1973-08-22      1        3 48      1      6   1
  1973-10-11      2        4 25      0      3   5
  1974-01-05      3        9 57      0      7   2
  1975-04-12      7       21 63      4      6  16