Problem with multiword dictionaries in quanteda using dfm_lookup

Question

I'm a beginner using R and quanteda and I can't solve the following issue, even after having read similar threads.

I have a dataset imported from Stata where the column "text" contains tweets from different groups of people identified by the variable "group". I want to count occurences of words identified by my dictionary at group level in the following way:

Here is a reproducible example:

dput(tweets[1:4, ])
structure(list(tweet_id = c("174457180812_10156824364270813", 
"174457180812_10156824136360813", "174457180812_10156823535820813", 
"174457180812_10156823868565813"), tweet_message = c("Climate change is a big issue", 
"We should care about the environment", "Let's rethink environmental policies", 
"#Davos WEF"
), date = c("2019-03-25T23:03:56+0000", "2019-03-25T21:10:36+0000", 
"2019-03-25T21:00:03+0000", "2019-03-25T20:00:03+0000"), group = c("1", 
"2", "3", "4")), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame"))

First I create my dictionary:

    climatechange_dict <- dictionary(list(
  climate = c(
    "environment*",
    "climate change")))

Then I specify the corpus

climate_corpus <- corpus(tweets$tweet_message)

I create a dfm for each group:

group1_dfm <- dfm(corpus_subset(climate_corpus, tweets$group == "1"))

And then I try to calculate the frequence of the words in the dictionary for each group:

group1_climate <- dfm_lookup(group1_dfm, dictionary = climatechange_dict)
group1 <- subset(tweets, tweets$group == "1")
group1$climatescore <- as.numeric(group1_climate[,1])

group1$climate <- "normal"
group1$climate[group1$climatescore > 0] <- "climate"
table(group1$climate)

My problem is that in this way multiword dictionary entries such as "climate change" are not counted. I have read online I need to apply tokens_lookup() to the tokens and then construct the dfm, but I don't know how to do that in this case. I would be really grateful if you could help me on this. Many thanks!

Could you edit your post to add some sample or simulated corpus text data for us to use in troubleshooting this problem? That would make it easier for someone to help you. — xilliam

Ken Benoit Ken Benoit · Accepted Answer · 2020-03-13T00:30:33

It's hard to make sure that this will work since you don't supply a reproducible example, but try this:

climate_corpus <- corpus(tweets, text_field = "tweet_message")

climatechange_dict <- 
    dictionary(list(climate = c("environment*", "climate change")))

groupeddfm <- tokens(climate_corpus) %>%
    tokens_lookup(dictionary = climatechange_dict) %>%
    dfm(groups = "group")

This does the following:

creates a corpus from your tweets data.frame and adds the other variables as docvars. (If you know which is a unique document identifier, you could specify that column too using docid_field = "<yourdocidentifier>".)
Does the dictionary "lookup" operation on the tokens, which means you will pick up the phrases like "climate change". This is not happening with dfm_lookup() because dfm() converts the tokens into "features" which have no record of order any more, and so cannot recover phrases.
Consolidates the documents into groups according to the group column of tweets. This obviates the need for any manual grouping using subsets. (I think this is what you wanted, right?)

The resulting dfm will be ngroups x 1, where 1 is the single key for your dictionary. You can easily coerce this to a data.frame or other format using convert().

Problem with multiword dictionaries in quanteda using dfm_lookup

1 Answers