5
votes

I have a data.frame with a large number of columns whose names follow a pattern. Such as:

df <- data.frame(
  x_1 = c(1, NA, 3), 
  x_2 = c(1, 2, 4), 
  y_1 = c(NA, 2, 1), 
  y_2 = c(5, 6, 7)
)

I would like to apply mutate_at to perform the same operation on each pair of columns. As in:

df %>%
  mutate(
    x = ifelse(is.na(x_1), x_2, x_1), 
    y = ifelse(is.na(y_1), y_2, y_1)
  )

Is there a way I can do that with mutate_at/mutate_each?

This:

df %>%
  mutate_each(vars(x_1, y_1), funs(ifelse(is.na(.), vars(x_2, y_2), .)))

and various variations I've tried all fail.

The question is similar to Using functions of multiple columns in a dplyr mutate_at call, but different in that the second argument to the function call is not a single column, but a different column for each column in vars.

Thanks in advance.

3
I'm working with a similar thing right now. It's the same issue as in my previous question: stackoverflow.com/questions/47005763/…, but in this case the dataset is so big that RStudio crashes.LightonGlass
a data.table set loop is probably one of the faster ways to do this. dplyr::coalesce might be a bit better for readability as wellzacdav

3 Answers

1
votes

Old question, but I agree with Jesse that you need to tidy your data a bit. gather would be the way to go, but it lacks somehow the possibility of stats::reshape where you can specify groups of columns to gather. So here's a solution with reshape:

df %>% 
   reshape(varying   = list(c("x_1", "y_1"), c("x_2", "y_2")), 
           times     = c("x", "y"),
           direction = "long") %>% 
   mutate(x = ifelse(is.na(x_1), x_2, x_1)) %>% 
   reshape(idvar     = "id", 
           timevar   = "time",
           direction = "wide") %>% 
   rename_all(funs(gsub("[a-zA-Z]+(_*)([0-9]*)\\.([a-zA-Z]+)", "\\3\\1\\2", .)))
#   id x_1 x_2 x y_1 y_2 y
# 1  1   1   1 1  NA   5 5
# 2  2  NA   2 2   2   6 2
# 3  3   3   4 3   1   7 1

In order to do that with any number of column pairs, you could do something like:

df2 <- setNames(cbind(df, df), c(t(outer(letters[23:26], 1:2, paste, sep = "_"))))
v <- split(names(df2), purrr::map_chr(names(df2), ~ gsub(".*_(.*)", "\\1", .)))
n <- unique(purrr::map_chr(names(df2), ~ gsub("_[0-9]+", "", .) ))
df2 %>% 
    reshape(varying   = v, 
            times     = n,
            direction = "long") %>% 
     mutate(x = ifelse(is.na(!!sym(v[[1]][1])), !!sym(v[[2]][1]), !!sym(v[[1]][1]))) %>% 
     reshape(idvar     = "id", 
             timevar   = "time",
             direction = "wide") %>% 
     rename_all(funs(gsub("[a-zA-Z]+(_*)([0-9]*)\\.([a-zA-Z]+)", "\\3\\1\\2", .)))
#   id w_1 w_2 w x_1 x_2 x y_1 y_2 y z_1 z_2 z
# 1  1   1   1 1  NA   5 5   1   1 1  NA   5 5
# 2  2  NA   2 2   2   6 2  NA   2 2   2   6 2
# 3  3   3   4 3   1   7 1   3   4 3   1   7 1

This assumes that columns which should be compared are next to each other and that all columns for with possible NA values are in columns suffixed by _1 and the replacement value columns are sufficed by _2.

3
votes

I don't know if you can get it that way, but here's a different perspective on the problem. If you find yourself with really wide data (e.g., tons of columns with similar names) and you want to do something with them, it might help to tidy the data (long in stata terms) with tidyr::gather (see docs here http://tidyr.tidyverse.org/).

> df %>% gather()
   key value
1  x_1     1
2  x_1    NA
3  x_1     3
4  x_2     1
5  x_2     2
6  x_2     4
7  y_1    NA
8  y_1     2
9  y_1     1
10 y_2     5
11 y_2     6
12 y_2     7

After converting the data to this format, it's easier to combine and rearrange values using group_by instead of trying to mutate_at things. E.g., you can ge the first values with df %>% gather() %>% mutate(var = substr(key,1,1)) and manipulate the xs and ys differently using group_by(var).

-3
votes

When I asked this question, the answer was "you can't!" That's no longer the answer, since tidyr now supports pivot_wider and pivot_longer.