0
votes

Assume we have a tf-idf weighted dfm from a corpus of 10K rather small documents.

What's the quanteda way of extracting the top feature, i.e., max tf-idf values by document? I do want the entire corpus to be the reference when computing tf-idf. Something along the lines of

topfeatures(some_dfm_tf_idf, n =3, decreasing = TRUE, groups ="id")

returns an appropriate list. Yet it takes quite some time for something that is basically sorted out already at this point. Given that quanteda performs so well in everything I did so far, I am suspect I am might be doing something wrong here.

Maybe this is somewhat related to this discussion on github (https://github.com/quanteda/quanteda/issues/1646) and the example workaround that @Astelix shows.

3

3 Answers

2
votes

topfeatures() is somewhat slow because it sorts each feature and then returns the top value. A more efficient way to get just the top valued feature in each document is to use max.col. Here's the method and a comparison (putting the return in a list of the same format as the topfeatures() answer).

library("quanteda")
## Package version: 1.5.2

data(data_corpus_sotu, package = "quanteda.corpora")
dfmat <- dfm(data_corpus_sotu) %>%
  dfm_tfidf()

# alternative using max.col
get_top_feature <- function(x) {
  topfeature_index <- max.col(x, "first")
  result <- mapply(function(a, b) {
    l <- as.numeric(x[a, b])
    names(l) <- featnames(x)[b]
    l
  },
  seq_len(ndoc(x)), topfeature_index,
  SIMPLIFY = FALSE
  )
  names(result) <- docnames(x)
  result
}

microbenchmark::microbenchmark(
  topfeatures = topfeatures(dfmat, n = 1, groups = docnames(dfmat)),
  maxcol = get_top_feature(dfmat),
  times = 20, unit = "relative"
)
## Unit: relative
##         expr      min       lq     mean   median       uq      max neval
##  topfeatures 2.085184 2.113136 2.069444 2.104166 2.032536 1.987218    20
##       maxcol 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20
1
votes

topfeatures() is exactly the way to go. I'm not sure why you are stating that it "takes quite some time", or what your "id" docvar is, but the following is the correct and most efficient way to get a list of the top scored features in your dfm (regardless of the weighting).

The result is a named list where the names are the docnames, and each element is a named numeric vector where the element name is the feature label.

library("quanteda")
## Package version: 1.5.2

some_dfm_tf_idf <- dfm(data_corpus_irishbudget2010)[1:5, ] %>%
  dfm_tfidf()

topfeatures(some_dfm_tf_idf, n = 1, groups = docnames(some_dfm_tf_idf))
## $`Lenihan, Brian (FF)`
## details 
## 5.57116 
## 
## $`Bruton, Richard (FG)`
## confront 
##  5.59176 
## 
## $`Burton, Joan (LAB)`
## lenihan 
## 4.19382 
## 
## $`Morgan, Arthur (SF)`
##    sinn 
## 5.59176 
## 
## $`Cowen, Brian (FF)`
## dividend 
##  4.19382
0
votes

In addition to Ken's get_top_feature() you might not only be interested in the "top" terms but also in the terms with the 2nd-heighest weighting in a tfidf-dtm. It took me some time to figure it out, so I thought it might be helpful in general.

get_scnd_feature <- function(x) {
topfeature_index <- max.col(x, 'first')
scndfeature_index <- max.col(replace(x, cbind(seq_len(nrow(x)), topfeature_index), -Inf), 'first')
  result <- mapply(function(a, b) {
    l <- as.numeric(x[a, b])
    names(l) <- featnames(x)[b]
    l
  },
  seq_len(ndoc(x)), scndfeature_index,
  SIMPLIFY = FALSE
  )
  names(result) <- docnames(x)
  result
}
scndterm_tfidf <- get_scnd_feature(dtmtfidf)

You can check the result by comparing the weights:

maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
scndtermcount <- apply(dtmtfidf, 1, function(x)x[maxn(2)(x)])