How to use quanteda to find instances of appearance of certain words before certain others in a sentence

Question

As an R newbie, by using quanteda I am trying to find instances when a certain word sequentially appears somewhere before another certain word in a sentence. To be more specific, I am looking for instances when the word "investors" is located somewhere before the word "shall" in a sentence in the corpus consisted of an international treaty concluded between Morocco and Nigeria (the text can be found here: https://edit.wti.org/app.php/document/show/bde2bcf4-e20b-4d05-a3f1-5b9eb86d3b3b).

The problem is that sometimes there are multiple words between these two words. For instance, sometimes it is written as "investors and investments shall". I tried to apply similar solutions offered on this website. When I tried the solution on (Keyword in context (kwic) for skipgrams?) and ran the following code:

 kwic(corpus_mar_nga, phrase("investors * shall"))

I get 0 observations since this counts only instances when there is only one word between "investors" and "shall".

And when I follow another solution offered on (Is it possible to use `kwic` function to find words near to each other?) and ran the following code:

toks <- tokens(corpus_mar_nga)
toks_investors <- tokens_select(toks, "investors", window = 10)
kwic(toks_investors, "shall")

I get instances when "investor" appear also after "shall" and this changes the context fundamentally since in that case, the subject of the sentence is something different.

At the end, in addition to instances of "investors shall", I should also be getting, for example the instances when it reads as "Investors, their investment and host state authorities shall", but I can't do it with the above codes.

Could anyone offer me a solution on this issue?

Huge thanks in advance!

The window argument takes a vector of two values like c(10, 0) for before and after the matches. I hope solves your problem. — Kohei Watanabe
Thanks a lot for your answer Prof. @KoheiWatanabe. It partially solves the problem. But for reasons I don't understand I fail to see all the exact matches after applying kwic, even if I change the window argument as you suggested. For instance, in addition to other instances where "investor" appears before "shall", there should be 5 instances of the exact phrase "Investors and investments shall" in the document, but they don't appear in the kwic dataframe. Could you have any idea about that? — niemand

Ken Benoit Ken Benoit · Accepted Answer · 2020-11-21T09:37:08

Good question. Here are two methods, one relying on regular expressions on the corpus text, and the second using (as @Kohei_Watanabe suggests in the comment) using window for tokens_select().

First, create some sample text.

library("quanteda")
## Package version: 2.1.2

# sample text
txt <- c("The investors and their supporters shall do something.
          Shall we tell the investors?  Investors shall invest.
          Shall someone else do something?")

Now reshape this into sentences, since your search occurs within sentence.

# reshape to sentences
corp <- txt %>%
  corpus() %>%
  corpus_reshape(to = "sentences")

Method 1 uses regular expressions. We add a boundary (\\b) before "investors", and the .+ says one or more of any character in between "investors" and "shall". (This would not catch newlines, but corpus_reshape(x, to = "sentences") will remove them.)

# method 1: regular expressions
corp$flag <- stringi::stri_detect_regex(corp, "\\binvestors.+shall",
  case_insensitive = TRUE
)
print(corpus_subset(corp, flag == TRUE), -1, -1)
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "The investors and their supporters shall do something."
## 
## text1.2 :
## "Investors shall invest."

A second method applies tokens_select() with an asymmetric window, with kwic(). First we select all documents (which are sentences) containing "investors", but discarding tokens before and keeping all tokens after. 1000 tokens after should be enough. Then, apply the kwic() where we keep all context words but focus on the word after, which by definition must be after, since the first word was "investors".

# method 2: tokens_select()
toks <- tokens(corp)
tokens_select(toks, "investors", window = c(0, 1000)) %>%
  kwic("shall", window = 1000)
##                                                                     
##  [text1.1, 5] investors and their supporters | shall | do something.
##  [text1.3, 2]                      Investors | shall | invest.

The choice depends on what suits your needs best.

How to use quanteda to find instances of appearance of certain words before certain others in a sentence

1 Answers