I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset.
This is the subset of the dataset I'm using as a reproducible example:
test_cluster <- speeches_subset %>%
filter(grepl('Schwester Agnes',
speechContent,
ignore.case = TRUE))
test_corpus <- corpus(test_cluster,
docid_field = "id",
text_field = "speechContent")
Here, test_cluster
contains six observations of 12 variables, that is, six rows in which the column speechContent
contains the compound word "Schwester Agnes". test_corpus
transforms the underlying data into a quanteda
corpus object.
When I then run the following code, I would expect, first, the content of the speechContent
variables to be tokenized, and due to tokens_compound
, the compound word "Schwester Agnes" to be tokenized as such. In a second step, I would expect the kwic() function to return a dataframe consisting of six rows, with the keyword
variable including the compound word "Schwester Agnes". Instead, however, kwic() returns an empty dataframe containing 0 observations of 7 variables. I think this is because of some mistake I'm making with tokens_compound()
, but I'm not sure... Any help would be greatly appreciated!
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("Schwester Agnes"))
test_kwic <- kwic(test_tokens,
pattern = "Schwester Agnes",
window = 5)