I'm looking to replace some of my R code that use dplyr::do because this function will soon be deprecated. A lot of my work requires creating stratified CDF plots. When using dply:do the variable I stratify on is passed to the resulting data frame as a variable, which I can then use easily for plotting.
I have a solution to replace dplyr::do using dplyr::group_split and purrr::map_df. However, the variable I pass in dplyr::group_split is not named in the resulting data frame. This makes plotting stratified data difficult. How do I ensure that the variable I pass in dlyr::group_split is named in the resulting data frame?
Here is some code creating the data I need to plot with dplyr::do:
library(dplyr)
library(purrr)
library(ggplot2)
# simulate data
dat <- tibble(
strat = rep(letters[1:3], each = 33),
var = rnorm(99, 0, 1))
# example 1 that works, but will be depricated
test_dat_1 <- dat %>%
dplyr::select(strat, var) %>%
dplyr::group_by(strat) %>%
dplyr::do(data.frame(X = wtd.Ecdf(.[[2]])$x,
Y = wtd.Ecdf(.[[2]])$ecdf*100))
# this is the target plot
p <- ggplot(test_dat_1, aes(X, Y, colour = strat))
p + geom_step()
Here is the solution to create the data with new tidy and purrr functions, but is limited in that the variable I am stratifying on is not provided in the final data frame, which makes plotting the stratified data cumbersome:
# replacement for 'do'
test_dat_2 <- dat %>%
group_split(strat) %>%
map_df(~wtd.Ecdf(.x$var),
tibble::enframe(name = "X", value = "Y"))
group_nest()
. Sincesplit()
names the output list you can use that instead ofgroup_split()
. A simple example:dat %>% split(.$strat) %>% map_df(~data.frame(X = mean(.x$var) ), .id = "strat")
– aosmith