Finding most frequent term in each document of a corpus

Question

I've been using R's tm package with much success on classificaiton issues. I know how to find the most frequent terms across the entire corpus (with findFreqTerms()), but don't see anything within the documentation that would find the most frequent term (after I've stemmed and removed stopwords, but before I remove sparse terms) in each individual document in the corpus. I've tried using apply() and the max command, but this gives me the maximum number of times the term in each document occurs, not the name of the term itself.

library(tm)

data("crude")
corpus<-tm_map(crude, removePunctuation)
corpus<-tm_map(corpus, stripWhitespace)
corpus<-tm_map(corpus, tolower)
corpus<-tm_map(corpus, removeWords, stopwords("English"))
corpus<-tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
maxterms<-apply(dtm, 1, max)
maxterms
127 144 191 194 211 236 237 242 246 248 273 349 352 
 5  13   2   3   3  10   8   3   7   9   9   4   5 
353 368 489 502 543 704 708 
 4   4   4   5   5   9   4

Thoughts?

Tyler Rinker Tyler Rinker · Accepted Answer · 2013-11-04T04:15:26

Ben's answer gives what you've asked for but I am not sure if what you asked for is wise. It does not account for ties. Here is an approach and a second one using the qdap package. They will give you lists with the words (in qdap's case a list of data frames with words and frequencies. You can use unlist to get you the rest of the way with the first option and lapply, indexing and unlist with qdap. The qdap approach works on the raw Corpus:

Option #1:

apply(dtm, 1, function(x) unlist(dtm[["dimnames"]][2], 
    use.names = FALSE)[x == max(x)])

Option #2 with qdap:

library(qdap)
dat <- tm_corpus2df(crude)
tapply(stemmer(dat$text), dat$docs, freq_terms, top = 1, 
    stopwords = tm::stopwords("English"))

Wrapping the tapply with lapply(WRAP_HERE, "[", 1) makes the two answers identical in content and nearly in format.

EDIT: Added an example that is a leaner use of qdap:

FUN <- function(x) freq_terms(x, top = 1, stopwords = stopwords("English"))[, 1]
lapply(stemmer(crude), FUN)

## [[1]]
## [1] "oil"   "price"
## 
## [[2]]
## [1] "opec"
## 
## [[3]]
## [1] "canada"   "canadian" "crude"    "oil"      "post"     "price"    "texaco"  
## 
## [[4]]
## [1] "crude"
## 
## [[5]]
## [1] "estim"  "reserv" "said"   "trust" 
## 
## [[6]]
## [1] "kuwait" "said"  
## 
## [[7]]
## [1] "report" "say"   
## 
## [[8]]
## [1] "yesterday"
## 
## [[9]]
## [1] "billion"
## 
## [[10]]
## [1] "market" "price" 
## 
## [[11]]
## [1] "mln"
## 
## [[12]]
## [1] "oil"
## 
## [[13]]
## [1] "oil"   "price"
## 
## [[14]]
## [1] "oil"  "opec"
## 
## [[15]]
## [1] "power"
## 
## [[16]]
## [1] "oil"
## 
## [[17]]
## [1] "oil"
## 
## [[18]]
## [1] "dlrs"
## 
## [[19]]
## [1] "futur"
## 
## [[20]]
## [1] "januari"

Finding most frequent term in each document of a corpus

2 Answers