29
votes

In R, grep usually matches a vector of multiple strings against one regexp.

Q: Is there a possibility to match a single string against multiple regexps? (without looping through each single regexp pattern)?

Some background:

I have 7000+ keywords as indicators for several categories. I cannot change that keyword dictionary. The dictionary has following structure (keywords in col 1, numbers indicate categories where these keywords belong to):

ab  10  37  41
abbrach*    38
abbreche    39
abbrich*    39
abend*  37
abendessen* 60  63
aber    20  23  45
abermals    37

Concatenating so many keywords with "|" is not a feasible way (and I wouldn't know which of the keywords generated the hit). Also, just reversing "patterns" and "strings" does not work, as the patterns have truncations, which wouldn't work the other way round.

[related question, other programming language]

3
I like dan's suggestions but with a large data set you may be running into some significant speed issues. If you want to look something up in a dictionary and return a corresponding value I would suggest a different approach: Breaking the sentences up into vectors of individual words with strsplit and then applying a hash table for fast look up. I'm thinking that you may want to break the keyword and the category indicators into two separate columns in the dictionary as well. I'd provide assistance there but only after you're more clear about want as a final outcome.Tyler Rinker
Agreed on restructuring the dictionary data and using a hash table for lookup (depending on the desired outcome), but the match should be relatively fast depending on the number of strings, even with a large number of keywords. I'll add a quick benchmark to my answer.danpelota
If you really have a lot of words (typically, all the words in a human language, all the words indexed by google, etc.), you can use a prefix tree (it is sometimes also called a "trie"). But I am not aware of any implementation in R.Vincent Zoonekynd

3 Answers

34
votes

What about applying the regexpr function over a vector of keywords?

keywords <- c("dog", "cat", "bird")

strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")

sapply(keywords, regexpr, strings, ignore.case=TRUE)

     dog cat bird
[1,]  15  -1   -1
[2,]  -1   4   15
[3,]  -1  -1   -1

    sapply(keywords, regexpr, strings[1], ignore.case=TRUE)

 dog  cat bird 
  15   -1   -1 

Values returned are the position of the first character in the match, with -1 meaning no match.

If the position of the match is irrelevant, use grepl instead:

sapply(keywords, grepl, strings, ignore.case=TRUE)

       dog   cat  bird
[1,]  TRUE FALSE FALSE
[2,] FALSE  TRUE  TRUE
[3,] FALSE FALSE FALSE

Update: This runs relatively quick on my system, even with a large number of keywords:

# Available on most *nix systems
words <- scan("/usr/share/dict/words", what="")
length(words)
[1] 234936

system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE))

   user  system elapsed 
  7.495   0.155   7.596 

dim(matches)
[1]      3 234936
2
votes

To expand on the other answer, to transform the sapply() output into a useful logical vector you need to further use an apply() step.

keywords <- c("dog", "cat", "bird")
strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
(matches <- sapply(keywords, grepl, strings, ignore.case=TRUE))
#        dog   cat  bird
# [1,]  TRUE FALSE FALSE
# [2,] FALSE  TRUE  TRUE
# [3,] FALSE FALSE FALSE

To know which strings contain any of the keywords (patterns):

apply(matches, 1, any)
# [1]  TRUE  TRUE FALSE

To know which keywords (patterns) were matched in the supplied strings:

apply(matches, 2, any)
#  dog  cat bird 
# TRUE TRUE TRUE
2
votes

re2r package can match multiple patterns (in parallel). Minimal example:

# compile patterns
re <- re2r::re2(keywords)
# match strings
re2r::re2_detect(strings, re, parallel = TRUE)