1
votes

In quanteda, is there a way to select a sentence on the condition that 2 words are cooccuring? I found the way to tokenize the text corpus into sentences. Playing with kwic and tokens_select seems to suggest they implement a logical OR for the 2 terms and not an AND.

I can do ti with stringr but I wanted to be sure I was not missing something

Example with stringr:

library(tidyverse)

myStr <- c("soil carbon is the best", 
           "biodiversity is key", 
           "soil carbon is biodiversity by nature")

keyw <- c("soil","biodiversity")

tibble(sentences = myStr,
       hit_soil_carbon_biodiveristy = unlist(purrr::map(myStr,~all(str_detect(.x,keyw)))))

Thank you for any inputs!

1

1 Answers

2
votes

Yes - you can isolate the phrase (sequence) using kwic() and then reform the selected sentences into a new corpus with only the selected ones. By setting the kwic window = 1000 you are ensuring that even very long sentences (2000+2 tokens) are selected.

library("quanteda")

# reformat the corpus as sentences
sentcorp <- corpus_reshape(data_corpus_inaugural, to = "sentences")
tail(texts(sentcorp))
#                                           2017-Trump.83 
#          "Together, we will make America strong again." 
#                                           2017-Trump.84 
#                   "We will make America wealthy again." 
#                                           2017-Trump.85 
#                     "We will make America proud again." 
#                                           2017-Trump.86 
#                      "We will make America safe again." 
#                                           2017-Trump.87 
# "And, yes, together, we will make America great again." 
#                                           2017-Trump.88 
#      "Thank you, God bless you, and God bless America." 

# illustrate the selection
kwic(sentcorp, phrase("nuclear w*"), window = 3)
# [1977-Carter.47, 18:19]  elimination of all | nuclear weapons | from this Earth
# [1985-Reagan.88, 12:13] further increase of | nuclear weapons | .              
#  [1985-Reagan.90, 9:10]          one day of | nuclear weapons | from the face  
# [1985-Reagan.91, 27:28]          the use of | nuclear weapons | , the other    
#   [1985-Reagan.96, 4:5]     It would render | nuclear weapons | obsolete.  

# now pipe the longer kwic results back into a corpus
newsentcorp <- 
    kwic(sentcorp, phrase("nuclear w*"), window = 1000) %>%
    corpus(split_context = FALSE) %>%
    texts()
newsentcorp[-4]  # because 4 is really long    
#                                                                                                   1977-Carter.47.L18 
# "And we will move this year a step toward ultimate goal - - the elimination of all nuclear weapons from this Earth." 
#                                                                                                   1985-Reagan.88.L12 
#                                        "We are not just discussing limits on a further increase of nuclear weapons." 
#                                                                                                    1985-Reagan.90.L9 
#                               "We seek the total elimination one day of nuclear weapons from the face of the Earth." 
#                                                                                                    1985-Reagan.96.L4 
#                                                                          "It would render nuclear weapons obsolete."