3
votes

I'm looking to replace some of my R code that use dplyr::do because this function will soon be deprecated. A lot of my work requires creating stratified CDF plots. When using dply:do the variable I stratify on is passed to the resulting data frame as a variable, which I can then use easily for plotting.

I have a solution to replace dplyr::do using dplyr::group_split and purrr::map_df. However, the variable I pass in dplyr::group_split is not named in the resulting data frame. This makes plotting stratified data difficult. How do I ensure that the variable I pass in dlyr::group_split is named in the resulting data frame?

Here is some code creating the data I need to plot with dplyr::do:

library(dplyr)
library(purrr)
library(ggplot2)

# simulate data
dat <- tibble(
  strat = rep(letters[1:3], each = 33), 
  var   = rnorm(99, 0, 1))

# example 1 that works, but will be depricated 
test_dat_1 <- dat %>% 
  dplyr::select(strat, var) %>%
  dplyr::group_by(strat) %>%
  dplyr::do(data.frame(X = wtd.Ecdf(.[[2]])$x, 
                       Y = wtd.Ecdf(.[[2]])$ecdf*100))

# this is the target plot
p <- ggplot(test_dat_1, aes(X, Y, colour = strat))
p + geom_step()

Here is the solution to create the data with new tidy and purrr functions, but is limited in that the variable I am stratifying on is not provided in the final data frame, which makes plotting the stratified data cumbersome:

# replacement for 'do'
test_dat_2 <- dat %>%
  group_split(strat) %>%
  map_df(~wtd.Ecdf(.x$var),
         tibble::enframe(name = "X", value = "Y"))
2
I've struggled with this sort of thing, too. One option is to switch to something like group_nest(). Since split() names the output list you can use that instead of group_split(). A simple example: dat %>% split(.$strat) %>% map_df(~data.frame(X = mean(.x$var) ), .id = "strat")aosmith

2 Answers

2
votes

An alternative option to splitting is nesting with group_nest(). After nesting you do the map() within mutate().

If you want to plot all group together you can then unnest(), from tidyr.

I wrote an anonymous function in map() rather than use the tilde.

dat %>%
    group_nest(strat) %>%
    mutate(result = map(data, function(dat) {
        res = Hmisc::wtd.Ecdf(dat$var)
        data.frame(X = res$x, Y = res$ecdf*100)
        }) ) %>%
    tidyr::unnest(result)

# A tibble: 102 x 4
   strat data                   X     Y
   <chr> <list>             <dbl> <dbl>
 1 a     <tibble [33 x 1]> -1.88   0   
 2 a     <tibble [33 x 1]> -1.88   3.03
 3 a     <tibble [33 x 1]> -1.76   6.06
 4 a     <tibble [33 x 1]> -1.17   9.09
...

You could get rid of the data column as needed with select() or data = NULL within the mutate() call prior to unnesting.

3
votes

Assuming that wtd.Ecdf is from Hmisc, the output is a named list, which can be converted to a two column dataset with as_tibble, modify the 'ecdf' column as in the do solution with mutate

library(dplyr)
library(purrr)
library(Hmisc)
library(ggplot2)
test_dat_2 <- dat %>% 
                 group_split(strat) %>% 
                 map_df(~ c(strat = first(.x$strat), wtd.Ecdf(.x$var)) %>% 
                              as_tibble %>%
                              mutate(ecdf = ecdf * 100)) %>%
                 rename_at(2:3, ~ c("X", "Y"))

Now, use that in plotting

p <- ggplot(test_dat_2, aes(X, Y, colour = strat))
p + geom_step()

-output

enter image description here


An option is also to do this after nesting

test_dat_3 <- dat %>%
                group_by(strat) %>%
                nest() %>% 
                mutate(out = map(data, ~ wtd.Ecdf(.x$var) %>% 
                          as_tibble)) %>% 
                select(-data) %>%
                unnest %>% 
                rename_at(2:3, ~c("X", "Y"))