10
votes

I'm having some trouble carrying out a routine using the dplyr package. In short, I have a function which takes a dataframe as an input, and returns a single (numeric) value; I'd like to be able to apply this function to several subsets of a dataframe. It feels like I should be able to use group_by() to specify the subsets of the dataframe, then pipe along to the summarize() function, but I'm not sure how to pass the (subsetted) dataframe along to the function I'd like to apply.

As a simplified example, let's say I'm using the iris dataset, and I've got a fairly simple function which I'd like to apply to several subsets of the data:

data(iris)
lm.func = function(.data){
  lm.fit = lm(Petal.Width ~ Petal.Length, data = .data)
  out = summary(lm.fit)$coefficients[2,1]
  return(out)
}

Now, I'd like to be able to apply this function to subsets of iris based on some other variable, like Species. I'm able to manually filter the data, then pipe along to my function, for example:

iris %>% filter(Species == "setosa") %>% lm.func(.)

But I'd like to be able to apply lm.func to each subset of the data, based on Species. My first thought was to try something like the following:

iris %>% group_by(Species) %>% summarize(coef.val = lm.func(.))

Even though I know this doesn't work, my idea is to try to pass each subset of iris to the lm.func function.

To clarify, I'd like to end up with a dataframe with two columns -- a first with each level of the grouping variable, and a second with the output of lm.func when the data are restricted to a subset specified by the grouping variable.

Is it possible to use summarize() in this way?

2
This solved it -- thanks akrun! - Mark T Patterson

2 Answers

15
votes

You can try with do

 iris %>% 
      group_by(Species) %>%
      do(data.frame(coef.val=lm.func(.)))
 #     Species  coef.val
 #1     setosa 0.2012451
 #2 versicolor 0.3310536
 #3  virginica 0.1602970
3
votes

There is an easy way to do without creating a function.

library(broom)
models <-iris %>% 
  group_by(Species) %>%
  do(
    mod = lm(Petal.Width ~ Petal.Length, data =.)
  )

  models %>% do(tidy(.$mod))

          term    estimate  std.error  statistic      p.value
1  (Intercept) -0.04822033 0.12164115 -0.3964146 6.935561e-01
2 Petal.Length  0.20124509 0.08263253  2.4354220 1.863892e-02
3  (Intercept) -0.08428835 0.16070140 -0.5245029 6.023428e-01
4 Petal.Length  0.33105360 0.03750041  8.8279995 1.271916e-11
5  (Intercept)  1.13603130 0.37936622  2.9945505 4.336312e-03
6 Petal.Length  0.16029696 0.06800119  2.3572668 2.253577e-02