plyr::ddply equivalent in dplyr

Question

I personally learned plyr prior to dplyr, and I'm trying to normalize my code into the dplyr syntax wherever possible, but I get stuck with the following use-case:

ddply(
    .data = somedataframe, 
    .variables = c('var1', 'var2'),
    .function = 
        function(thisdf){
            ...
        }
)

Where the ... inside the function call is some arbitrarily complex modification of the dataframe. Note that the choice of ddply versus dlply (or anyother dxply) is purely for illustration. Does a function within dplyr exists (call it dplyr::f for the moment), that could also take an arbitrary modification function? For example:

somedataframe %>% 
    group_by(var1, var2) %>% 
    dplyr::f(.function = function(thisdf){ ... })

In my investigation of this functionality, all the examples that I could find were extremely simple summarise implementations of ddply.

I think the just released update to dplyr has some additional grouping verbs such as group_by_map that attempt to cover this kind of thing. — joran
Sorry, it's group_map and you can read about it here. The older way of doing this sort of thing in dplyr (that still works, I believe) is do(). — joran
you're not very far from a reproducible example, just give us a data set and a real function, with the expected output, and you'll get a great answer in no time and will help more people with the same issue. — Moody_Mudskipper
@Moody_Mudskipper - joran has sufficiently answered the question in his comment — jameselmore

CoderGuy123 CoderGuy123 · Accepted Answer · 2019-05-07T08:19:51

Probably the simplest way is using the dplyr::do() function but one can also use the group_modify(). Complete example:

library(tidyverse)

#some complex function
func = function(x) {
  mod = lm(Sepal.Length ~ Petal.Width, data = x)
  mod_coefs = broom::tidy(mod)

  tibble(
    mean_sepal_length = mean(x$Sepal.Length),
    mean_petal_width = mean(x$Petal.Width), 
    slope = mod_coefs[[2, 2]],
    slope_p = mod_coefs[[2, 5]]
  )
}

#plyr version
plyr::ddply(iris, "Species", func)

#dplyr with do()
iris %>% 
  group_by(Species) %>% 
  do(func(.))

#dplyr with group_map()
#have to rewrite the function to take a second argument, which is the grouping variable
func2 = function(x, y) {
  mod = lm(Sepal.Length ~ Petal.Width, data = x)
  mod_coefs = broom::tidy(mod)

  tibble(
    mean_sepal_length = mean(x$Sepal.Length),
    mean_petal_width = mean(x$Petal.Width), 
    slope = mod_coefs[[2, 2]],
    slope_p = mod_coefs[[2, 5]]
  )
}

iris %>% 
  group_by(Species) %>% 
  group_modify(func2)

These produce:

     Species mean_sepal_length mean_petal_width     slope      slope_p
1     setosa             5.006            0.246 0.9301727 5.052644e-02
2 versicolor             5.936            1.326 1.4263647 4.035422e-05
3  virginica             6.588            2.026 0.6508306 4.798149e-02

# A tibble: 3 x 5
# Groups:   Species [3]
  Species    mean_sepal_length mean_petal_width slope   slope_p
  <fct>                  <dbl>            <dbl> <dbl>     <dbl>
1 setosa                  5.01            0.246 0.930 0.0505   
2 versicolor              5.94            1.33  1.43  0.0000404
3 virginica               6.59            2.03  0.651 0.0480   

# A tibble: 3 x 5
# Groups:   Species [3]
  Species    mean_sepal_length mean_petal_width slope   slope_p
  <fct>                  <dbl>            <dbl> <dbl>     <dbl>
1 setosa                  5.01            0.246 0.930 0.0505   
2 versicolor              5.94            1.33  1.43  0.0000404
3 virginica               6.59            2.03  0.651 0.0480

There are 2 differences. The ddply() output is a standard data frame, even though the function outputted a tibble. The dplyr outputs are grouped tibbles, despite the grouping had been 'used'.

plyr::ddply equivalent in dplyr

1 Answers