get what percent of documents contain a feature - quanteda

Question

I'm trying to understand what % of documents contain a feature using quanteda. I know the dfm_weight() is available, but I believe the 'prop' feature looks at feature frequency within a document and not across documents.

My goal would be to avoid having to do the ifelse statement and keep it all in quanteda, but I'm not sure this is possible. The output I'm looking for is a side-by-side bar chart grouped by year that has features along the y-axis and % occurrence in documents along the x. The interpretation here would then be "In 20% of all comments in 2018, people mention the word X, compared to 24% in 2019."

library(quanteda)
library(reshape2)
library(dplyr)

df$rownum = 1:nrow(df) # unique ID
dfCorp19 = df %>%
  corpus(df, text_field = 'WhatPromptedYourSearch', docid_field = 'rownum')

x = dfm(dfCorp19,
        remove=c(stopwords(), toRemove),
        remove_numbers = TRUE,
        remove_punct = TRUE) %>%
    textstat_frequency(groups ='year') 

x = x %>% group_by(group) %>% mutate(prop = ifelse(group=='2019', docfreq/802, docfreq/930))
x = dcast(x,feature ~ group, value.var='prop')

What are the two "sides" of the bars? If it's just % mentioning, then no point putting % not mentioning. — Ken Benoit
Sorry I was grouping by year so each bar for a word would be a year. So “ in 2018, Apple was mentioned 20% vs in 2019 it was mentioned 25%” — Ted Mosby
where is df coming from? could you include a link (maybe Dropbox/GoogleDocs)? — Nate

Ken Benoit Ken Benoit · Accepted Answer · 2019-10-21T15:45:39

Here's an attempt using some demo data, where the group is decade.

library("quanteda")
#> Package version: 1.5.1

docvars(data_corpus_inaugural, "decade") <-
    floor(docvars(data_corpus_inaugural, "Year") / 10) * 10

dfmat <- dfm(corpus_subset(data_corpus_inaugural, decade >= 1970))

target_word <- "nuclear"

Now we can just extract a data.frame for the target feature. Note the rowSums() function since otherwise, any slice of a dfm is still a dfm (not a vector).

df <- data.frame(docname = docnames(dfmat),
                 decade = docvars(dfmat, c("decade")),
                 contains_target = rowSums(dfmat[, "nuclear"]) > 0,
                 row.names = NULL)
df
#>         docname decade contains_target
#> 1    1973-Nixon   1970            TRUE
#> 2   1977-Carter   1970            TRUE
#> 3   1981-Reagan   1980           FALSE
#> 4   1985-Reagan   1980            TRUE
#> 5     1989-Bush   1980           FALSE
#> 6  1993-Clinton   1990           FALSE
#> 7  1997-Clinton   1990            TRUE
#> 8     2001-Bush   2000           FALSE
#> 9     2005-Bush   2000           FALSE
#> 10   2009-Obama   2000            TRUE
#> 11   2013-Obama   2010           FALSE
#> 12   2017-Trump   2010           FALSE

With that, it's a simple matter to summarize proportions and plot them using some dplyr and ggplot2.

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df2 <- df %>%
    group_by(decade) %>%
    summarise(n = n()) %>%
    mutate(freq = n / sum(n))

library("ggplot2")
g <- ggplot(df2, aes(y = freq, x = decade)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    xlab("") + ylab("Proportion of documents containing target word")
g

^{Created on 2019-10-21 by the reprex package (v0.3.0)}

get what percent of documents contain a feature - quanteda

1 Answers