How to get the percentage of documents that contain a feature(s)

Question

I'm using this solution(get what percent of documents contain a feature - quanteda) to find the number of documents that contain any one of a group of features in my dataset. As long as the document contains any one of the words, I want it to return TRUE.

I got it to work, but it only works some of the time and I can't figure out why. Removing or adding words works sometimes and not at other times. This is the code I used (the compound phrases have already been "tokens_compound" in the dfm)

thetarget <- c("testing", "test", "example words", "example")

df <- data.frame(docname = docnames(dfm),
                 Year = docvars(dfm, c("Year")),
                 contains_target = rowSums(dfm[, thetarget]) > 0,
                 row.names = NULL)

And the error I get sometimes

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'rowSums': 
Subscript out of bounds

TIA

edit (script to create table showing a year and number of documents containing any of the target words):

 df2 <- df %>%
  mutate_if(is.logical, as.character) %>%
  filter(!str_detect(contains_target, "FALSE")) %>%
  group_by(Year) %>%
    summarise(n = n())

The problem is when dfm[, thetarget] is not defined; I don't understand the packages you're using, but does that help? — walter

Ken Benoit Ken Benoit · Accepted Answer · 2021-10-11T06:20:26

You are getting the error because in some dfm objects you create, not all of the features in thetarget are in the object dfm you have created.

Here's a way to avoid that, using docfreq():

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

thetarget <- c("nuclear", "congress", "economy", "_not_a_feature_")

dfmat <- tokens(data_corpus_inaugural) %>%
  tokens_select(thetarget) %>%
  dfm()

docfreq(dfmat) / ndoc(dfmat)
##    economy   congress    nuclear 
## 0.52542373 0.49152542 0.08474576

To get the data.frame in the question:

df <- data.frame(
  docname = docnames(dfmat),
  Year = docvars(dfmat, c("Year")),
  contains_target = as.logical(rowSums(dfmat)),
  row.names = NULL
)

head(df)
##           docname Year contains_target
## 1 1789-Washington 1789            TRUE
## 2 1793-Washington 1793           FALSE
## 3      1797-Adams 1797            TRUE
## 4  1801-Jefferson 1801            TRUE
## 5  1805-Jefferson 1805           FALSE
## 6    1809-Madison 1809            TRUE

How to get the percentage of documents that contain a feature(s)

1 Answers