3
votes

I need to summarize in a grouped data_frame (warn: a solution with dplyr is very much appreciated but isn't mandatory) both something on each group (simple) and the same something on "other" groups.

minimal example

if(!require(pacman)) install.packages(pacman)
pacman::p_load(dplyr)

df <- data_frame(
    group = c('a', 'a', 'b', 'b', 'c', 'c'),
    value = c(1, 2, 3, 4, 5, 6)
)

res <- df %>%
    group_by(group) %>%
    summarize(
        median        = median(value)
#        median_other  = ... ??? ... # I need the median of all "other"
                                     # groups
#        median_before = ... ??? ... # I need the median of groups (e.g
                                 #    the "before" in alphabetic order,
                                 #    but clearly every roule which is
                                 #    a "selection function" depending
                                 #    on the actual group is fine)
    )

my expected result is the following

group    median    median_other    median_before
  a        1.5         4.5               NA
  b        3.5         3.5               1.5
  c        5.5         2.5               2.5

I've searched on Google strings similar to "dplyr summarize excluding groups", "dplyr summarize other then group",I've searched on the dplyr documentation but I wasn't able to find a solution.

here, this (How to summarize value not matching the group using dplyr) does not apply because it runs only on sum, i.e. is a solution "function-specific" (and with a simple arithmetic function that did not consider the variability on each group). What about more complex function request (i.e. mean, sd, or user-function)? :-)

Thanks to all

PS: summarize() is an example, the same question leads to mutate() or other dplyr-functions working based on groups.

2
You can't just use library(dplyr) instead of the first two lines?Rich Scriven
If dplyr isn't installed on your system library(dplyr) return an error, so to be sure that anyone can run the code I had to write 2 line of code anyway and I decide to use pacman instead, which is a very usefull package in may opinion (because you can load (and install if needed) many package at the same time with just those two line of code)Corrado

2 Answers

2
votes

Here's my solution:

res <- df %>%
  group_by(group) %>%
  summarise(med_group = median(value),
            med_other = (median(df$value[df$group != group]))) %>% 
  mutate(med_before = lag(med_group))

> res
Source: local data frame [3 x 4]

      group med_group med_other med_before
  (chr)     (dbl)     (dbl)      (dbl)
1     a       1.5       4.5         NA
2     b       3.5       3.5        1.5
3     c       5.5       2.5        3.5

I was trying to come up with an all-dplyr solution but base R subsetting works just fine with median(df$value[df$group != group]) returning the median of all observations that are not in the current group.

I hope this help you to solve your problem.

2
votes

I don't think it is in general possible to perform operations on other groups within summarise() (i.e. I think the other groups are not "visible" when summarising a certain group). You can define your own functions and use them in mutate to apply them to a certain variable. For your updated example you can use

calc_med_other <- function(x) sapply(seq_along(x), function(i) median(x[-i]))
calc_med_before <- function(x) sapply(seq_along(x), function(i) ifelse(i == 1, NA, median(x[seq(i - 1)])))

df %>%
    group_by(group) %>%
    summarize(med = median(value)) %>%
    mutate(
        med_other = calc_med_other(med),
        med_before = calc_med_before(med)
    )
#   group   med med_other med_before
#   (chr) (dbl)     (dbl)      (dbl)
#1     a   1.5       4.5         NA
#2     b   3.5       3.5        1.5
#3     c   5.5       2.5        2.5