5
votes

I often want to perform tidyr::spread and dplyr::summarise in a "single step" to aggregate data by group. What I want is shown in expected. I can get expected by performing summarise and spread separately and combine the results with a dplyr::full_join but I'm looking for alternative approaches that avoid full_join. Bona fide single-step approaches are not necessary.

df <- data.frame(
        id = rep(letters[1], 2),
        val1 = c(10, 20),
        val2 = c(100, 200),
        key = c("A", "B"),
        value = c(1, 2))

library(tidyverse)
result1 <- df %>%
              group_by(id) %>%
              summarise(
                val1 = min(val1),
                val2 = max(val2)
              )
# A tibble: 1 x 3
  # id      val1  val2
  # <fctr> <dbl> <dbl>
# 1 a       10.0   200

result2 <- df %>%
              select(id, key, value) %>%
              group_by(id) %>%
              spread(key, value)
# A tibble: 1 x 3
# Groups: id [1]
  # id         A     B
# * <fctr> <dbl> <dbl>
# 1 a       1.00  2.00

expected <- full_join(result1, result2, by="id")
# A tibble: 1 x 5
  # id      val1  val2     A     B
  # <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 a       10.0   200  1.00  2.00
3

3 Answers

5
votes

I suspect your data may have more edge cases that require some modification, but why don't you simply spread then summarise? You can specify the summary function separately per variable, so for A and B where you don't actually need to calculate anything (I'm assuming) you can just remove all the NA:

df %>%
  spread("key", "value") %>%
  group_by(id) %>%
  summarise(
    val1 = min(val1),
    val2 = max(val2),
    A = mean(A, na.rm = TRUE),
    B = mean(B, na.rm = TRUE)
    )
# A tibble: 1 x 5
  id     val1  val2     A     B
  <fct> <dbl> <dbl> <dbl> <dbl>
1 a      10.0   200  1.00  2.00
0
votes

Self-answer: Here's an approach that works with tidyr::nest but it seems "messy" and not much better

df %>%
  group_by(id) %>%
  nest() %>%
  mutate(
    min_vals = map(data, ~.x %>% summarise(min_val = min(val1), max_val = max(val2))),
    data = map(data, ~select(.x, key, value) %>% spread(key, value))
  ) %>%
  unnest()

# A tibble: 1 x 5
  # id         A     B min_val max_val
  # <fctr> <dbl> <dbl>   <dbl>   <dbl>
# 1 a       1.00  2.00    10.0     200
0
votes

Another approach using do:

res <- df %>%
  group_by(id) %>%
  summarise(
    val1 = min(val1),
    val2 = max(val2),
    key = list(key),
    value = list(value)
  ) %>% group_by(id, val1, val2) %>%
  do( matrix(.$value[[1]], nrow=1) %>% as.data.frame %>% setNames(as.character(.$key[[1]])) )