2
votes

I would like to manipulate (rename and combine) features in a dfm, how to proceed?

The reason is as follows: I want to use a different stemming algorithm than the Porter stemmer implemented in Quanteda (the kpss algorithm called via Python).

Example The three-word sentence c("creatief creatieve creatie") will result in a dfm with three features (ie. "creatief", "creatieve", "creatie") all with a term-frequency of 1. However, the kpss algorithm will stem these words to "creatie". It would be very handy if I could combine these three features in the dfm into a single feature called "creatie" with a term-frequency of three.

Your help is deeply appreciated.

(Note. I understand that such data manipulations are possible after a dfm is transformed into a 'simple' matrix, but I would like to do this in a dfm).

Addendum I overlooked the dfm_compress function. I am almost there... After I have compressed the dfm, is it possible too to apply a dictionary, e.g. the words 'creati' and 'innovati' should be both counted as occurences of the word-category 'creati' (cf. the dictionary function in dfm)? (Note. Given the huge volume of txts I would rather not prefer to stem the raw data files)

1
Dear Ken, Thank you for your response. I overlooked the dfm_compress function. I am almost there... After I have compressed the dfm, is it possible too to apply a dictionary, e.g. the words 'creati' and 'innovati' should be both counted as occurences of the word-category 'creati' (cf. the dictionary function in dfm)? (Note. Given the huge volume of txts I would rather not prefer to stem the raw data files).pmkruyen

1 Answers

1
votes

You can do this by creating a dfm and then stemming the features, and then recompiling the dfm to combine features made identical after the stemming.

require(quanteda)
txt <- c("creatief creatieve creatie")

(dfm1 <- dfm(txt))
## Document-feature matrix of: 1 document, 3 features (0% sparse).
## 1 x 3 sparse Matrix of class "dfmSparse"
##        features
## docs    creatief creatieve creatie
##   text1        1         1       1

Here's a step that I have approximated for your example, but you would replace the right hand side string subset function below with your own stemming operation on the character vector of features.

# this approximates what you can do with the Python-based stemmer
# note that here you must use colnames<- since there is no function
# featnames<- (for replacement)
colnames(dfm1) <- stringi::stri_sub(featnames(dfm1), 1, 7)
dfm1
## Document-feature matrix of: 1 document, 3 features (0% sparse).
## 1 x 3 sparse Matrix of class "dfmSparse"
##        features
## docs    creatie creatie creatie
##   text1       1       1       1

Then you can recompile the dfm to compile the counts.

# this combines counts in featnames that are identical
dfm_compress(dfm1)
## Document-feature matrix of: 1 document, 1 feature (0% sparse).
## 1 x 1 sparse Matrix of class "dfmSparse"
##        features
## docs    creatie
##   text1       3

Note that if you used quanteda's stemmer, this step could be dfm_wordstem():

dfm_wordstem(dfm1)
## Document-feature matrix of: 1 document, 1 feature (0% sparse).
## 1 x 1 sparse Matrix of class "dfmSparse"
##        features
## docs    creati
##   text1      3