This question is very similar to Using pmap to apply different regular expressions to different variables in a tibble?, but differs because I realized my examples were not sufficient to describe my problem.
I'm trying to apply different regular expressions to different variables in a tibble. For example, I've made a tibble listing 1) the variable name I want to modify, 2) the regex I want to match, and 3) the replacement string. I'd like to apply the regex/replacement to the variable in a different data frame. Note that there may be variables in the target tibble that I don't want to modify, and the row order in my "configuration" tibble may not correspond to the column/variable order in my "target" tibble.
So my "configuration" tibble could look like this:
test_config <- dplyr::tibble(
string_col = c("col1", "col2", "col4", "col3"),
pattern = c("^\\.$", "^NA$", "^$", "^NULL$"),
replacement = c("","","", "")
)
I'd like to apply this to a target tibble:
test_target <- dplyr::tibble(
col1 = c("Foo", "bar", ".", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "NA", "NULL"),
col3 = c("Foo", "bar", ".", "NA", "NULL"),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
So the goal is to replace a different string with an empty string in user-specified column/variables of the test_target.
The result should be like this:
result <- dplyr::tibble(
col1 = c("Foo", "bar", "", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "", "NULL"),
col3 = c("Foo", "bar", ".", "NA", ""),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
I can do what I want with a for loop, like this:
for (i in seq(nrow(test_config))) {
test_target <- dplyr::mutate_at(test_target,
.vars = dplyr::vars(
tidyselect::matches(test_config$string_col[[i]])),
.funs = dplyr::funs(
stringr::str_replace_all(
., test_config$pattern[[i]],
test_config$replacement[[i]]))
)
}
Instead, is there a more tidy way to do what I want?
So far, thinking that purrr::pmap
was the tool for the job, I've made a function that takes a data frame, variable name, regular expression, and replacement value and returns the data frame with a single variable modified. It behaves as expected:
testFun <- function(df, colName, regex, repVal){
colName <- dplyr::enquo(colName)
df <- dplyr::mutate_at(df,
.vars = dplyr::vars(
tidyselect::matches(!!colName)),
.funs = dplyr::funs(
stringr::str_replace_all(., regex, repVal))
)
}
# try with example
out <- testFun(test_target,
test_config$string_col[[1]],
test_config$pattern[[1]],
"")
However, when I try to use that function with pmap
, I run into a couple problems:
1) is there a better way to build the list for the pmap call than this?
purrr::pmap(
list(test_target,
test_config$string_col,
test_config$pattern,
test_config$replacement),
testFun
)
2) When I call pmap
, I get an error:
Error: Element 2 has length 4, not 1 or 5.
So pmap
isn't happy that I'm trying to pass a tibble of length 5 as an element of a list whose other elements are of length 4 (I thought it would recycle the tibble).
Note also that previously, when I called pmap
with a 4-row tibble, I got a different error,
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "character"
Called from: tbl_vars(tbl)
Can any of you suggest a way to use pmap to do what I want, or is there a different or better tidyverse approach to the problem?
Thanks!
pmap
will accept a list of lists and the elements of those lists have to be of the same length or 1. I am not surepmap
is the right tool for what you are trying to accomplish – prosoitoslength(list(test_target))
. So I expected it to be a recycled element. For now, I've got the loop to fall back on, as well. – bheavnerpmap_dfr
and a%>% distinct()
and I get the result I want, but at the expense of a potentially big memory hog... Nicer to avoid making the 4 copies of the output tibble in the first place... – bheavner