I'm working with the Quanteda package in R at the moment, and I'd like to calculate the ngrams of a set of stemmed words to get a quick-and-dirty estimate of what content words tend to be near each other. If I try:
twitter.files <- textfile(files)
twitter.docs <- corpus(twitter.files)
twitter.semantic <- twitter.docs %>%
dfm(removeTwitter = TRUE, ignoredFeatures = stopwords("english"),
ngrams = 2, skip = 0:3, stem = TRUE) %>%
trim(minCount = 50, minDoc = 2)
It only stems the final word in the bigrams. However, if I try to stem first:
twitter.files <- textfile(files)
twitter.docs <- corpus(twitter.files)
stemmed_no_stops <- twitter.docs %>%
toLower %>%
tokenize(removePunct = TRUE, removeTwitter = TRUE) %>%
removeFeatures(stopwords("english")) %>%
wordstem
twitter.semantic <- stemmed_no_stops %>%
skipgrams(n = 2, skip = 0:2) %>%
dfm %>%
trim(minCount=25, minDoc = 2)
Then Quanteda doesn't know how to work with the stemmed list; I'll get the error:
assignment of an object of class “NULL” is not valid for @‘ngrams’
in an object of class “dfmSparse”; is(value, "integer") is not TRUE
Is there an intermediate step I can do to use a dfm on the stemmed words, or to tell dfm
to stem first and do ngrams second?
stem
andngrams
play nicely together. – Michael Anderson