I'm a beginner using R and quanteda and I can't solve the following issue, even after having read similar threads.
I have a dataset imported from Stata where the column "text" contains tweets from different groups of people identified by the variable "group". I want to count occurences of words identified by my dictionary at group level in the following way:
Here is a reproducible example:
dput(tweets[1:4, ])
structure(list(tweet_id = c("174457180812_10156824364270813",
"174457180812_10156824136360813", "174457180812_10156823535820813",
"174457180812_10156823868565813"), tweet_message = c("Climate change is a big issue",
"We should care about the environment", "Let's rethink environmental policies",
"#Davos WEF"
), date = c("2019-03-25T23:03:56+0000", "2019-03-25T21:10:36+0000",
"2019-03-25T21:00:03+0000", "2019-03-25T20:00:03+0000"), group = c("1",
"2", "3", "4")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
First I create my dictionary:
climatechange_dict <- dictionary(list(
climate = c(
"environment*",
"climate change")))
Then I specify the corpus
climate_corpus <- corpus(tweets$tweet_message)
I create a dfm for each group:
group1_dfm <- dfm(corpus_subset(climate_corpus, tweets$group == "1"))
And then I try to calculate the frequence of the words in the dictionary for each group:
group1_climate <- dfm_lookup(group1_dfm, dictionary = climatechange_dict)
group1 <- subset(tweets, tweets$group == "1")
group1$climatescore <- as.numeric(group1_climate[,1])
group1$climate <- "normal"
group1$climate[group1$climatescore > 0] <- "climate"
table(group1$climate)
My problem is that in this way multiword dictionary entries such as "climate change" are not counted. I have read online I need to apply tokens_lookup() to the tokens and then construct the dfm, but I don't know how to do that in this case. I would be really grateful if you could help me on this. Many thanks!