0
votes

substr() might be a great way to extract values with conditions (in our case, the condition is extracting values from brackets), but is there any handy way to extract multiple of them and create multiple columns (new columns number is the same as the number of extracted values).

Here is one example data:

        index  abc
  1     1      qwer(urt123) qweqwe
  2     2      rte(ret390) qweqwe(tertr213) ityorty(ret435)
  3     3      <NA>
  4     4      ogi(wqe685) qwe(ieow123)
  5     5      cvb(bnm567)

code for creating the question data frame:

data.frame(index = c(1:5),
           abc = c("qwer(urt123) qweqwe", "rte(ret390) qweqwe(tertr213) ityorty(ret435)",
                    NA, "ogi(wqe685) qwe(ieow123)", "cvb(bnm567)"))

Final results:

        index  abc                                          abc1     abc2     abc3
  1     1      qwer(urt123) qweqwe                          urt123   <NA>     <NA>
  2     2      rte(ret390) qweqwe(tertr213) ityorty(ret435) ret390   tertr213 ret435
  3     3      <NA>                                         <NA>     <NA>     <NA>
  4     4      ogi(wqe685) qwe(ieow123)                     wqe685   ieow123  <NA>
  5     5      cvb(bnm567)                                  bnm567   <NA>     <NA>

The original data set has more than 10,000 lines and the number of brackets in the abc column could be more or less than 3.

2
This post and this post might have some ideas that helpcamille

2 Answers

1
votes

Here is a base R solution

dfout <- cbind(df,
               gsub("\\(|\\)",
                    "",
                    do.call(rbind,
                            lapply(z <- with(df,regmatches(abc,gregexpr("\\(\\w+\\)",abc))),
                                   `length<-`,
                                   max(lengths(z))))))

such that

> dfout
  index                                          abc      1        2      3
1     1                          qwer(urt123) qweqwe urt123     <NA>   <NA>
2     2 rte(ret390) qweqwe(tertr213) ityorty(ret435) ret390 tertr213 ret435
3     3                                         <NA>   <NA>     <NA>   <NA>
4     4                     ogi(wqe685) qwe(ieow123) wqe685  ieow123   <NA>
5     5                                  cvb(bnm567) bnm567     <NA>   <NA>
1
votes

Here is my attempt. I used regular expression to extract alphabets and numbers staying inside parentheses. stri_extract_all_regex() returns a list. So I used unnest_wider() to create new columns. The final step was to revise three column names. After useing unnest_wider() we get ...1 as a column name, for example. Any column names that contain ... got revised; I replaced ... with foo.

library(tidyverse)
library(stringi)

mutate(mydf,
       foo = stri_extract_all_regex(str = abc, pattern = "(?<=\\()[[:alnum:]]+(?=\\))")) %>% 
unnest_wider(foo) %>% 
rename_at(vars(contains("...")),
          .funs = list(~sub(x = ., pattern = "\\.+", replacement = "foo")))

  index abc                                          foo1   foo2     foo3  
  <int> <chr>                                        <chr>  <chr>    <chr> 
1     1 qwer(urt123) qweqwe                          urt123 NA       NA    
2     2 rte(ret390) qweqwe(tertr213) ityorty(ret435) ret390 tertr213 ret435
3     3 NA                                           NA     NA       NA    
4     4 ogi(wqe685) qwe(ieow123)                     wqe685 ieow123  NA    
5     5 cvb(bnm567)                                  bnm567 NA       NA   

DATA

mydf <- structure(list(index = 1:5, abc = c("qwer(urt123) qweqwe", "rte(ret390) 
qweqwe(tertr213) ityorty(ret435)", 
NA, "ogi(wqe685) qwe(ieow123)", "cvb(bnm567)")), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))