Using pmap with a to apply different regular expressions to different variables in a tibble?

Question

This question is very similar to Using pmap to apply different regular expressions to different variables in a tibble?, but differs because I realized my examples were not sufficient to describe my problem.

I'm trying to apply different regular expressions to different variables in a tibble. For example, I've made a tibble listing 1) the variable name I want to modify, 2) the regex I want to match, and 3) the replacement string. I'd like to apply the regex/replacement to the variable in a different data frame. Note that there may be variables in the target tibble that I don't want to modify, and the row order in my "configuration" tibble may not correspond to the column/variable order in my "target" tibble.

So my "configuration" tibble could look like this:

test_config <-  dplyr::tibble(
  string_col = c("col1", "col2", "col4", "col3"),
  pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
  replacement = c("","","", "")
)

I'd like to apply this to a target tibble:

test_target <- dplyr::tibble(
  col1 = c("Foo", "bar", ".", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "NA", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", "NULL"),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)

So the goal is to replace a different string with an empty string in user-specified column/variables of the test_target.

The result should be like this:

result <- dplyr::tibble(
  col1 = c("Foo", "bar", "", "NA", "NULL"),
  col2 = c("Foo", "bar", ".", "", "NULL"),
  col3 = c("Foo", "bar", ".", "NA", ""),
  col4 = c("NULL", "NA", "Foo", ".", "bar"),
  col5 = c("I", "am", "not", "changing", ".")
)

I can do what I want with a for loop, like this:

for (i in seq(nrow(test_config))) {
  test_target <- dplyr::mutate_at(test_target,
                   .vars = dplyr::vars(
                     tidyselect::matches(test_config$string_col[[i]])),
                   .funs = dplyr::funs(
                     stringr::str_replace_all(
                       ., test_config$pattern[[i]], 
                       test_config$replacement[[i]]))
  )
}

Instead, is there a more tidy way to do what I want? So far, thinking that purrr::pmap was the tool for the job, I've made a function that takes a data frame, variable name, regular expression, and replacement value and returns the data frame with a single variable modified. It behaves as expected:

testFun <- function(df, colName, regex, repVal){
  colName <- dplyr::enquo(colName)
  df <- dplyr::mutate_at(df,
                         .vars = dplyr::vars(
                           tidyselect::matches(!!colName)),
                         .funs = dplyr::funs(
                           stringr::str_replace_all(., regex, repVal))
  )
}

# try with example
out <- testFun(test_target, 
               test_config$string_col[[1]], 
               test_config$pattern[[1]], 
               "")

However, when I try to use that function with pmap, I run into a couple problems: 1) is there a better way to build the list for the pmap call than this?

purrr::pmap(
    list(test_target, 
         test_config$string_col, 
         test_config$pattern, 
         test_config$replacement),
    testFun
)

2) When I call pmap, I get an error:

Error: Element 2 has length 4, not 1 or 5.

So pmap isn't happy that I'm trying to pass a tibble of length 5 as an element of a list whose other elements are of length 4 (I thought it would recycle the tibble).

Note also that previously, when I called pmap with a 4-row tibble, I got a different error,

Error in UseMethod("tbl_vars") : 
  no applicable method for 'tbl_vars' applied to an object of class "character"
Called from: tbl_vars(tbl)

Can any of you suggest a way to use pmap to do what I want, or is there a different or better tidyverse approach to the problem?

Thanks!

Yes, pmap will accept a list of lists and the elements of those lists have to be of the same length or 1. I am not sure pmap is the right tool for what you are trying to accomplish — prosoitos
I had thought passing the tibble as a list item would behave because I thought it was a 1-length list, as in length(list(test_target)). So I expected it to be a recycled element. For now, I've got the loop to fall back on, as well. — bheavner
That's a useful brainstorm! This is close... ` purrr::pmap( list(list(test_target), test_config$string_col, test_config$pattern, test_config$replacement), testFun )` But that leads to a list of 4 tibbles. I can solve it with pmap_dfr and a %>% distinct() and I get the result I want, but at the expense of a potentially big memory hog... Nicer to avoid making the 4 copies of the output tibble in the first place... — bheavner

camille camille · Accepted Answer · 2018-10-30T20:47:35

Here are two tidyverse ways. One is similar to the data.table answer, in that it involves reshaping the data, joining it with the configs, and reshaping back to wide. The other is purrr-based and, in my opinion, a little bit of a weird approach. I'd recommend the first, since it feels more intuitive.

Use tidyr::gather to make the data long-shaped, then dplyr::left_join to make sure that every text value from test_target has a corresponding pattern & replacement—even the cases (col5) without patterns will be retained by using a left join.

library(tidyverse)
...

test_target %>%
  gather(key = col, value = text) %>%
  left_join(test_config, by = c("col" = "string_col"))
#> # A tibble: 25 x 4
#>    col   text  pattern replacement
#>    <chr> <chr> <chr>   <chr>      
#>  1 col1  Foo   "^\\.$" ""         
#>  2 col1  bar   "^\\.$" ""         
#>  3 col1  .     "^\\.$" ""         
#>  4 col1  NA    "^\\.$" ""         
#>  5 col1  NULL  "^\\.$" ""         
#>  6 col2  Foo   ^NA$    ""         
#>  7 col2  bar   ^NA$    ""         
#>  8 col2  .     ^NA$    ""         
#>  9 col2  NA    ^NA$    ""         
#> 10 col2  NULL  ^NA$    ""         
#> # ... with 15 more rows

Using an ifelse replace the pattern where a pattern exists, or keep the original text if the pattern doesn't. Keep just the necessary patterns, add a row number because spread needs unique IDs, and make the data wide again.

test_target %>%
  gather(key = col, value = text) %>%
  left_join(test_config, by = c("col" = "string_col")) %>% 
  mutate(new_text = ifelse(is.na(pattern), text, str_replace(text, pattern, replacement))) %>%
  select(col, new_text) %>%
  group_by(col) %>%
  mutate(row = row_number()) %>%
  spread(key = col, value = new_text) %>%
  select(-row)
#> # A tibble: 5 x 5
#>   col1  col2  col3  col4  col5    
#>   <chr> <chr> <chr> <chr> <chr>   
#> 1 Foo   Foo   Foo   NULL  I       
#> 2 bar   bar   bar   NA    am      
#> 3 ""    .     .     Foo   not     
#> 4 NA    ""    NA    .     changing
#> 5 NULL  NULL  ""    bar   .

The second way is to make a tiny tibble of just the column names, join that with the configs, and split into a list of lists. Then purrr::map2_dfc maps over both this list you've created and the columns of test_target, and returns a data frame by cbinding. The reason this works is that data frames are technically lists of columns, so if you map over a data frame, you're treating each column like a list item. I couldn't get a ifelse to work right here—something in the logic had only single strings coming back instead of the whole vector.

tibble(all_cols = names(test_target)) %>%
  left_join(test_config, by = c("all_cols" = "string_col")) %>%
  split(.$all_cols) %>%
  map(as.list) %>%
  map2_dfc(test_target, function(info, text) {
    if (is.na(info$pattern)) {
      text
    } else {
      str_replace(text, info$pattern, info$replacement)
    }
  })
#> # A tibble: 5 x 5
#>   col1  col2  col3  col4  col5    
#>   <chr> <chr> <chr> <chr> <chr>   
#> 1 Foo   Foo   Foo   NULL  I       
#> 2 bar   bar   bar   NA    am      
#> 3 ""    .     .     Foo   not     
#> 4 NA    ""    NA    .     changing
#> 5 NULL  NULL  ""    bar   .

^{Created on 2018-10-30 by the reprex package (v0.2.1)}

Using pmap with a to apply different regular expressions to different variables in a tibble?

4 Answers