3
votes

This should be easy for anyone who understands regular expressions as I'm struggling to do.

I have a vector of strings that looks like

strings<-c("jklsflk fKASJLJ (LN/WEC/WPS); jsdfjDFSDKTdfkls jfdjk kdkd(LN/WEC/WPS)",
"PEARYMP PEARYVIRGN_16 1 (LN/MP/MP)",
"08VERMLN XF03 08VERMLN_345_3 (XF/CIN/*)")

I want to convert this vector into a dataframe where each row is from an element of the original vector with 3 columns where each column comes from the part in parenthesis. So the result here would be

col1        col2       col3
"LN"        "WEC"      "WPS"
"LN"        "MP"       "MP"
"XF"        "CIN"      "*"

If there are more than one instance of the pattern in a string then it should take the first instance.

I think my main problem is that ( is a special character and I'm trying to escape it \( but I get an error that \( is an unrecognized escape character so I'm just a little lost.

2

2 Answers

4
votes

Sounds like you're forgetting to escape the \ in \(, i.e. \\(:

do.call(rbind, strsplit(sub('.*?\\((.*?)\\).*', '\\1', strings), split = "/"))
     [,1] [,2]  [,3] 
[1,] "LN" "WEC" "WPS"
[2,] "LN" "MP"  "MP" 
[3,] "XF" "CIN" "*"  
2
votes

1) We define a pattern that matches

left-paren non-slashes slash non-slashes slash non-right-parens remainder

which correspond to the following respectively:

\\( ([^/]+) / ([^/]+) / ([^)]+) .*

Now extract the parenthesized portions using strapplyc and simplify into a matrix. The code is:

library(gsubfn)
pat <- "\\(([^/]+)/([^/]+)/([^)]+).*"
strapplyc(strings, pat, simplify = cbind)

giving:

     [,1] [,2]  [,3] 
[1,] "LN" "WEC" "WPS"
[2,] "LN" "MP"  "MP" 
[3,] "XF" "CIN" "*" 

2) This alternative uses strapplyc nested in strapply. The regular expressions are slightly simpler and its still basically one line of code but that code line is longer. The first regex picks out everything between the first set of parens and the second extracts the slash-separated fields:

strapply(strings, "\\(([^)]+).*", ~ strapplyc(x, "[^/]+")[[1]], simplify = rbind)

REVISED Some improvements to first solution plus a variation as second solution.