Building corpus in Quanteda while keeping track of the ID

Question

I have a dataset in which I have multiple texts per user. I want to build a corpus of all those documents with Quanteda but without losing the ability to link back the different texts to the corresponding user.

I will give you a sample code to help you understand a little bit more where I am failing.

df <- data.frame('ID'=c(1,1,2), 'Text'=c('I ate apple', "I don't like fruits", "I swim in the dark"), stringsAsFactors = FALSE)
df_corpus <- corpus(df$Text, docnames =df$ID)
corpus_DFM <- dfm(df_corpus, tolower = TRUE, stem = FALSE)
print(corpus_DFM)

This results in

Document-feature matrix of: 3 documents, 10 features (60.0% sparse).
3 x 10 sparse Matrix of class "dfm"
     features
docs  i ate apple don't like fruits swim in the dark
  1   1   1     1     0    0      0    0  0   0    0
  1.1 1   0     0     1    1      1    0  0   0    0
  2   1   0     0     0    0      0    1  1   1    1
>

But I would like to obtain in dataframe that looks like this in my Document-feature matrix


Document-feature matrix of: 3 documents, 10 features (60.0% sparse).
3 x 10 sparse Matrix of class "dfm"
       features
docs    id  i ate apple don't like fruits swim in the dark
  text1 1   1   1     1     0    0      0    0  0   0    0
  text2 1   1   0     0     1    1      1    0  0   0    0
  text3 2   1   0     0     0    0      0    1  1   1    1
>

Is there a way to automatize this process using Quanteda. I would like to modify the the docs column of the dfm object but I do not know how to have access to it.

Any help would be welcome!

Thank you.

Ken Benoit Ken Benoit · Accepted Answer · 2020-01-23T17:18:18

The issue here is that you are specifying the docnames as "ID", but document names have to be unique. This is why the corpus constructor function assigns 1, 1.1, 2 to your docnames based on the non-unique ID.

Solution? Let corpus() assign the docnames, and keep ID as a docvar (document variable). Easiest to do this by inputting the data.frame to corpus(), which calls the data.frame method than the character method for corpus(). (See ?corpus.)

Change your code to be:

> df_corpus <- corpus(df, text_field =  "Text")
> corpus_DFM <- dfm(df_corpus, tolower = TRUE, stem = FALSE)
> print(corpus_DFM)
Document-feature matrix of: 3 documents, 10 features (60.0% sparse).
3 x 10 sparse Matrix of class "dfm"
       features
docs    i ate apple don't like fruits swim in the dark
  text1 1   1     1     0    0      0    0  0   0    0
  text2 1   0     0     1    1      1    0  0   0    0
  text3 1   0     0     0    0      0    1  1   1    1
> 
> docvars(corpus_DFM, "ID")
[1] 1 1 2

This enables you to easily recombine your dfm by user, if you want:

> dfm_group(corpus_DFM, groups = "ID")
Document-feature matrix of: 2 documents, 10 features (45.0% sparse).
2 x 10 sparse Matrix of class "dfm"
    features
docs i ate apple don't like fruits swim in the dark
   1 2   1     1     1    1      1    0  0   0    0
   2 1   0     0     0    0      0    1  1   1    1

Building corpus in Quanteda while keeping track of the ID

1 Answers