Feature extraction using Chi2 with Quanteda

Question

I have a dataframe df with this structure :

Rank Review
5    good film
8    very good film
..

Then I tried to create a DocumentTermMatris using quanteda package :

mydfm <- dfm(df$Review, remove = stopwords("english"), stem = TRUE)

I would like how to calculate for each feature (term) the CHi2 value with document in order to extract best feature in terms of Chi2 value

Can you help me to resolve this problem please?

EDIT :

head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)


> head(mydfm[, 5:10])
Document-feature matrix of: 63,023 documents, 6 features (92.3% sparse).
(showing first 6 documents and first 6 features)
       features
docs    bon accueil conseillèr efficac écout répond
  text1   0       0          0       0     0      0
  text2   1       1          1       1     1      1
  text3   0       0          0       0     0      0
  text4   0       0          0       0     0      0
  text5   0       0          1       0     0      0
  text6   0       0          0       0     1      0
  ...
  text60300 0     0          1       1     1      1

Here I have my dfm matrix, then I create my tf-idf matrix :

tfidf <- tfidf(mydfm)[, 5:10]

I would like to determine chi2 value between these features and the documents (here I have 60300 documents) :

textstat_keyness(mydfm, target = 2)

But, since I have 60300 target, I don't know how to do this automatically . I see in the Quanteda manual that groups option in dfm function may resolve this problem, but I don't see how to do it. :(

EDIT 2 :

Rank Review 10 always good 1 nice film 3 fine as usual

Here I try to group document with dfm :

 mydfm <- dfm(Review, remove = stopwords("english"), stem = TRUE, groups = Rank)

But it fails to group documents

Can you help me please to resolve this problem

Thank you

Ken Benoit Ken Benoit · Accepted Answer · 2017-06-01T15:45:49

See ?textstat_keyness. The default measure is chi-squared. You can change the target argument to set a particular document's frequencies against all other frequencies. e.g.

textstat_keyness(mydfm, target = 1)

for the first document against the frequencies of all others, or

textstat_keyness(mydfm, target = 2)

for the second against all others, etc.

If you want to compare categories of frequencies that group documents, you would need to use the groups = option in dfm() for a supplied variable or on in the docvars. See the example in ?textstat_keyness.

Feature extraction using Chi2 with Quanteda

1 Answers