What's the correct way to extract tf-idf topfeatures by document?

Question

Assume we have a tf-idf weighted dfm from a corpus of 10K rather small documents.

What's the quanteda way of extracting the top feature, i.e., max tf-idf values by document? I do want the entire corpus to be the reference when computing tf-idf. Something along the lines of

topfeatures(some_dfm_tf_idf, n =3, decreasing = TRUE, groups ="id")

returns an appropriate list. Yet it takes quite some time for something that is basically sorted out already at this point. Given that quanteda performs so well in everything I did so far, I am suspect I am might be doing something wrong here.

Maybe this is somewhat related to this discussion on github (https://github.com/quanteda/quanteda/issues/1646) and the example workaround that @Astelix shows.

Ken Benoit Ken Benoit · Accepted Answer · 2019-12-04T12:21:04

topfeatures() is somewhat slow because it sorts each feature and then returns the top value. A more efficient way to get just the top valued feature in each document is to use max.col. Here's the method and a comparison (putting the return in a list of the same format as the topfeatures() answer).

library("quanteda")
## Package version: 1.5.2

data(data_corpus_sotu, package = "quanteda.corpora")
dfmat <- dfm(data_corpus_sotu) %>%
  dfm_tfidf()

# alternative using max.col
get_top_feature <- function(x) {
  topfeature_index <- max.col(x, "first")
  result <- mapply(function(a, b) {
    l <- as.numeric(x[a, b])
    names(l) <- featnames(x)[b]
    l
  },
  seq_len(ndoc(x)), topfeature_index,
  SIMPLIFY = FALSE
  )
  names(result) <- docnames(x)
  result
}

microbenchmark::microbenchmark(
  topfeatures = topfeatures(dfmat, n = 1, groups = docnames(dfmat)),
  maxcol = get_top_feature(dfmat),
  times = 20, unit = "relative"
)
## Unit: relative
##         expr      min       lq     mean   median       uq      max neval
##  topfeatures 2.085184 2.113136 2.069444 2.104166 2.032536 1.987218    20
##       maxcol 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20

What's the correct way to extract tf-idf topfeatures by document?

3 Answers