Extract info inside all parenthesis in R

55

votes

I have a character string and what to extract the information inside of multiple parentheses. Currently I can extract the information from the last parenthesis with the code below. How would I do it so it extracts multiple parentheses and returns as a vector?

j <- "What kind of cheese isn't your cheese? (wonder) Nacho cheese! (groan) (Laugh)"                                                          
sub("\\).*", "", sub(".*\\(", "", j))

Current output is:

[1] "Laugh"

Desired output is:

[1] "wonder" "groan"  "Laugh"

regexr

66

votes

Here is an example:

> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan"  "Laugh"

I think this should work well:

> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)"  "(Laugh)"

but the results includes parenthesis... why?

This works:

regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]

Thanks @MartinMorgan for the comment.

29

votes

Using the stringr package we can reduce this a little bit.

library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)

@kohske uses regmatches but I'm currently using 2.13 so don't have access to that function at the moment. This adds the dependency on stringr but I think it is a little easier to work with and the code is a little clearer (well... as clear as using regular expressions can be...)

Edit: We could also try something like this -

re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])

This one works by defining a marked subexpression inside the regular expression. It extracts everything that matches the regex and then gsub extracts only the portion inside the subexpression.

16

votes

I think there are basically three easy ways of extracting multiple capture groups in R (without using substitution); str_match_all, str_extract_all, and regmatches/gregexpr combo.

I like @kohske's regex, which looks behind for an open parenthesis ?<=\\(, looks ahead for a closing parenthesis ?=\\), and grabs everything in the middle (lazily) .+?, in other words (?<=\\().+?(?=\\))

Using the same regex:

str_match_all returns the answer as a matrix.

str_match_all(j, "(?<=\\().+?(?=\\))")

     [,1]    
[1,] "wonder"
[2,] "groan" 
[3,] "Laugh" 

# Subset the matrix like this....

str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan"  "Laugh"

str_extract_all returns the answer as a list.

str_extract_all(j,  "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan"  "Laugh" 

#Subset the list...
str_extract_all(j,  "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan"  "Laugh"

regmatches/gregexpr also returns the answer as a list. Since this is a base R option, some people prefer it. Note the recommended perl = TRUE.

regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan"  "Laugh" 

#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan"  "Laugh"

Hopefully, the SO community will correct/edit this answer if I've mischaracterized the most popular options.

6

votes

Using rex may make this type of task a little simpler.

matches <- re_matches(j,
  rex(
    "(",
    capture(name = "text", except_any_of(")")),
    ")"),
  global = TRUE)

matches[[1]]$text
#>[1] "wonder" "groan"  "Laugh"

Extract info inside all parenthesis in R

4 Answers