Here you need a method for detecting collocations, which fortunately quanteda has in the form of textstat_collocations()
. Once you detect these, you can compound your tokens to make these into a single "token", and then get their frequencies in the standard way.
You do not need to know the length in advance, but do need to specify a range. Below, I've added some more text, and included a size range from 2 to 3. This also picks up the "criminal background check", without confusing the term "background" that is also in the phrase "background work". (By default, detection is case insensitive.)
library("quanteda")
## Package version: 2.1.0
text <- c(
"Introduction Here you see something Related work another info here",
"Introduction another text Background work something to now",
"Background work is related to related work",
"criminal background checks are useful",
"The law requires criminal background checks"
)
colls <- textstat_collocations(text, size = 2:3)
colls
## collocation count count_nested length lambda z
## 1 criminal background 2 2 2 4.553877 2.5856967
## 2 background checks 2 2 2 4.007333 2.3794386
## 3 related work 2 2 2 2.871680 2.3412833
## 4 background work 2 2 2 2.322388 2.0862256
## 5 criminal background checks 2 0 3 -1.142097 -0.3426584
Here we can see that the phrases are being detected and distinguished. Now we can use tokens_compound to join them:
toks <- tokens(text) %>%
tokens_compound(colls, concatenator = " ")
dfm(toks) %>%
dfm_trim(min_termfreq = 2) %>%
dfm_remove(stopwords("en")) %>%
textstat_frequency()
## feature frequency rank docfreq group
## 1 introduction 2 1 2 all
## 2 something 2 1 2 all
## 3 another 2 1 2 all
## 4 related work 2 1 2 all
## 5 background work 2 1 2 all
## 6 criminal background checks 2 1 2 all