4
votes

I have this situation where I need a different summary function based on a condition. For example, using iris, say for some reason I wanted the sum of the petal width if the species was setosa, otherwise I wanted the mean of the petal width.

Naively, I wrote this using case_when, which does not work:

iris <- tibble::as_tibble(iris)

 iris %>% 
  group_by(Species) %>% 
  summarise(pwz = case_when(
    Species == "setosa" ~ sum(Petal.Width, na.rm = TRUE),
    TRUE                ~ mean(Petal.Width, na.rm = TRUE)))

Error in summarise_impl(.data, dots) : Column pwz must be length 1 (a summary value), not 50

I eventually found something like this, summarizing using each method, and then in a mutate picking which one I actually wanted:

iris %>% 
  group_by(Species) %>% 
  summarise(pws = sum(Petal.Width, na.rm = TRUE),
            pwm = mean(Petal.Width, na.rm = TRUE)) %>% 
  mutate(pwz = case_when(
    Species == "setosa" ~ pws,
    TRUE                ~ pwm)) %>% 
  select(-pws, -pwm)

But that seems more than a bit awkward with creating all these summarized values and only picking one at the end, especially when my real case_when is a lot more complicated. Can I not use case_when inside of summarise? Do I have my syntax wrong? Any help is appreciated!

Edit: I suppose I should have pointed out that I have multiple conditions/functions (just assume I've got, depending on the variable, some that need mean, sum, max, min, or other summary).

5
Sounds like the real problem might be that your data isn't in a "tidy" format. It's very odd to perform different summarizations of values in the same column depending on other column values. You cannot use case_when like that because it's a vectorized function so it does not collapse (you'd only be taking the sum/mean of one value at a time).MrFlick
It's in long format - environmental data observations for different variables over time (daily), by district. I'm creating weekly summaries, but, for instance, with rainfall you need to add to get the total rainfall during that week, but for temperature, take the average during the week. So it is probably weird, but it makes sense with this dataset. That's useful information about it being a non-collapsing vectorized function - that tells me why my first attempt didn't work, thank you.michdn

5 Answers

6
votes

This is pretty easy with data.table

library(data.table)
iris2 <- as.data.table(iris)

iris2[, if(Species == 'setosa') sum(Petal.Width) 
        else mean(Petal.Width)
      , by = Species]

More concisely, but maybe not as clear

iris2[, ifelse(Species == 'setosa', sum, mean)(Petal.Width)
      , by = Species]

With dplyr you can do

iris %>% 
  group_by(Species) %>% 
  summarise(pwz = if_else(first(Species == "setosa")
                          , sum(Petal.Width)
                          , mean(Petal.Width)))

Note:

I'm thinking it probably makes more sense to "spread" your data with tidyr::spread so that each day has a column for temperature, rainfall, etc. Then you can use summarise in the usual way.

1
votes
data(iris)
library(dplyr)

sum_species <- c('setosa')

iris %>% 
   group_by(Species) %>% 
   summarise(pwz_sum = sum(Petal.Width, na.rm=T), 
             pwz_mean= mean(Petal.Width, na.rm=T)) %>% 
   ungroup() %>% 
   mutate(pwz = if_else(Species %in% sum_species, pwz_sum, pwz_mean))
0
votes

You could always do something like this if you want to put everything in the summarise function. But it's no less complicated than your original workaround:

iris %>% 
  group_by(Species) %>% 
  summarise(pwz = 
    sum(Petal.Width, na.rm = TRUE)*
    (1/n()*mean(Species != "setosa") + 
     mean(Species == "setosa")))
0
votes

Why not calculate at the row level first, then summarize?

iris %>% group_by(Species) %>% mutate(pwz = case_when(
      Species == "setosa" ~ sum(Petal.Width, na.rm = TRUE),
      TRUE                ~ mean(Petal.Width, na.rm = TRUE))) %>% 
      summarize(pwz= first(pwz))

# A tibble: 3 x 2
     Species    pwz
      <fctr>  <dbl>
1     setosa 12.300
2 versicolor  1.326
3  virginica  2.026
0
votes

You could split your data.frame and then use map2_dfr to apply a different function on each part and stitch the results back together:

library(tidyverse) # purrr & dplyr
iris %>%
  arrange(Species=="setosa") %>%
  split(.,.$Species=="setosa") %>%
  map2_dfr(c(mean,sum),~.x %>% group_by(Species) %>% summarize_at("Petal.Width",.y))

# # A tibble: 3 x 2
# Species Petal.Width
#       <fctr>       <dbl>
# 1 versicolor       1.326
# 2  virginica       2.026
# 3     setosa      12.300