0
votes

I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset.

This is the subset of the dataset I'm using as a reproducible example:

test_cluster <- speeches_subset %>%
  filter(grepl('Schwester Agnes',
                speechContent,
                ignore.case = TRUE))

test_corpus <- corpus(test_cluster,
                      docid_field = "id",
                      text_field = "speechContent")

Here, test_cluster contains six observations of 12 variables, that is, six rows in which the column speechContent contains the compound word "Schwester Agnes". test_corpus transforms the underlying data into a quanteda corpus object.

When I then run the following code, I would expect, first, the content of the speechContent variables to be tokenized, and due to tokens_compound, the compound word "Schwester Agnes" to be tokenized as such. In a second step, I would expect the kwic() function to return a dataframe consisting of six rows, with the keyword variable including the compound word "Schwester Agnes". Instead, however, kwic() returns an empty dataframe containing 0 observations of 7 variables. I think this is because of some mistake I'm making with tokens_compound(), but I'm not sure... Any help would be greatly appreciated!

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = phrase("Schwester Agnes"))

test_kwic <- kwic(test_tokens,
                  pattern = "Schwester Agnes",
                  window = 5)