2
votes

What is the preferred way to send all columns within a current group to a function as a tibble or data.frame when calling an arbitrary function in a dplyr pipe?

In the example below, mean_B is a simple example where I know what is needed before I make the function call. mean_B_fun gives the wrong answer (compared to what I want-- I want the within-group mean), and mean_B_fun_ugly gives what I want, but it seems like both an inefficient (and ugly) way to get the effect I want.

The reason I want to operate on arbitrary columns is that in practice, I'm taking my_fun in the example below from the user, and I don't know the columns that the user will need to operate on a priori.

library(dplyr)

my_fun <- function(x) mean(x$B)

my_data <-
  expand.grid(A=1:3, B=1:2) %>%
  mutate(B=A*B) %>%
  group_by(A) %>%
  mutate(mean_B=mean(B),
         mean_B_fun=my_fun(.),
         mean_B_fun_ugly=my_fun(as.data.frame(.)[.$A == unique(A),,drop=FALSE]))
1
mutate_all will apply a function, by group, to all columns other than the grouping columns. For my_fun, the argument x should be a vector and the operation in the function would be mean(x), since mutate will pass a vector of values from a given column.eipi10
There are basically two types of functions in tidyverse: 1) those that take a dataframe as a first argument (used in pipes, e.g. tidyr::separate or dplyr::top_n) and 2) those that take vectors (e.g. all functions in stringr or many base functions, such as mean, max, sum) - these are typically used in mutate statements. There are some that can take either df or vector (like purrr::map), but the behaviour will be different. Your user-function should be type 2 - it should take a vector, not a dataframe. Assuming user does not subset inside the function, group_by will be honored.dmi3kno
@eipi10 The user may need to apply a function to multiple columns to get the output of their function. In general, what is written as mean(x$B) could alternatively be mean(x$B) + mean(x$A), and I wouldn't know which columns they need.Bill Denney
@dmi3kno, "Your user-function should... take a vector, not a dataframe." I can't make that restriction on my users. In one function call, they may need column A, and in the next function call they may need column B, and in the next function call they may need both A and B. More generally, I don't know all the column names in the user's dataset, what they will mean to the user, and which will be important.Bill Denney

1 Answers

0
votes

here's my answer, not knowing the columns on which you want to calculate the mean.

expand.grid(A=1:3, B=1:2) %>%
mutate(B=A*B) %>% nest(-A)  %>%
mutate(means = map(.$data, function(x) colMeans(x)))

  A data means
1 1 1, 2   1.5
2 2 2, 4     3
3 3 3, 6   4.5