I often have a need to mutate a data frame through the additional of several columns at once using a custom function, preferably using parallelization. Below are the ways I already know how to do this.
Setup
library(dplyr)
library(plyr)
library(purrr)
library(doMC)
registerDoMC(2)
df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))
Suppose that I want two new columns, foocol = x + y
and barcol = (x + y) * 100
, but that these are actually complex calculations done in a custom function.
Method 1: Add columns separately using rowwise
and mutate
foo <- function(x, y) return(x + y)
bar <- function(x, y) return((x + y) * 100)
df_out1 <- df %>% rowwise() %>% mutate(foocol = foo(x, y), barcol = bar(x, y))
This is not a good solution since it requires two function calls for each row and two "expensive" calculations of x + y
. It's also not parallelized.
Method 2: Trick ddply
into rowwise operation
df2 <- df
df2$id <- 1:nrow(df2)
df_out2 <- ddply(df2, .(id), function(r) {
foocol <- r$x + r$y
barcol <- foocol * 100
return(cbind(r, foocol, barcol))
}, .parallel = T)
Here I trick ddply
into calling a function on each row by splitting on a unique id
column I just created. It's clunky, though, and requires maintaining a useless column.
Method 3: splat
foobar <- function(x, y, ...) {
foocol <- x + y
barcol <- foocol * 100
return(data.frame(x, y, ..., foocol, barcol))
}
df_out3 <- splat(foobar)(df)
I like this solution since you can reference the columns of df
in the custom function (which can be anonymous if desired) without array comprehension. However, this method isn't parallelized.
Method 4: by_row
df_out4 <- df %>% by_row(function(r) {
foocol <- r$x + r$y
barcol <- foocol * 100
return(data.frame(foocol = foocol, barcol = barcol))
}, .collate = "cols")
The by_row
function from purrr eliminates the need for the unique id
column, but this operation isn't parallelized.
Method 5: pmap_df
df_out5 <- pmap_df(df, foobar)
# or equivalently...
df_out5 <- df %>% pmap_df(foobar)
This is the best option I've found. The pmap
family of functions also accept anonymous functions to apply to the arguments. I believe pmap_df
converts df
to a list and back, though, so maybe there is a performance hit.
It's also a bit annoying that I need to reference all the columns I plan on using for calculation in the function definition function(x, y, ...)
instead of just function(r)
for the row object.
Am I missing any good or better options? Are there any concerns with the methods I described?
plyr
after loadingdplyr
. It gives a warning that you should listen to. - Gregor Thomaspurrr::invoke
and its variants, e.g.invoke(foobar, df)
(equivalent todo.call(foobar, df)
). Really it's usually a good thing if you need to reference which columns get passed to what parameters, though, or you can make mistakes silently. - alistairerowwise()
. - Simon Jackson