conditional average after filtering by another dataframe

Question

full = data.frame(group = c('a', 'a', 'a', 'a', 'a', 'b', 'c'), values = c(1, 2, 2, 3, 5, 3, 4), year = c(2001, 2002, 2003, 2002, 2003, 2003, 2002))
max = data.frame(group = c('a', 'b', 'c'), year = c(2002, 2003, 2002))
## my attempt: 
full = full %>% group_by(group) %>% mutate(mean = mean(values[year != max$year[match(full$group, max$group)]], na.rm = TRUE))

I'm expecting a data.frame where with a new column mean that excludes years that are present in max (with associated group). But this is the output:

 group values year     mean
1     a      1 2001 2.666667
2     a      2 2002 2.666667
3     a      2 2003 2.666667
4     a      3 2002 2.666667
5     a      5 2003 2.666667
6     b      3 2003 3.000000
7     c      4 2002      NaN

Why is there a mean for b (5th row)? How can I change it for the mean to reflect that filter properly? I imagine it has something with this warning:

Warning message: In year != max$year[match(full$group, max$group)] : longer object length is not a multiple of shorter object length

full$group is a vector from the full table, not from the group_by subset. Maybe .$group or just group instead. — Frank
But would that match the different "max" years for different groups? — Rafael
Yes, any column x within a group_by statement refers to only the subset associated with that group_by. res = full %>% group_by(group) %>% mutate(mean = mean(values[year != max$year[match(first(group), max$group)]], na.rm = TRUE)) seems to work. I'm not sure if the first() wrapper is necessary. — Frank

Frank Frank · Accepted Answer · 2017-09-08T18:57:13

For what it's worth, I think this is cleaner in data.table:

library(data.table)
setDT(full); setDT(max)

mDT = full[!max, on=.(group, year)][ 
 .(unique(full$group)), on=.(group), mean(values), by=.EACHI]

   group       V1
1:     a 2.666667
2:     b       NA
3:     c       NA

Then you can add this column to the main table like full[mDT, on=.(group), v := i.V1] if desired.

Analogous dplyr code...

mDF = full %>% 
  anti_join(max) %>% 
  right_join(distinct(full, group)) %>% 
  group_by(group) %>% 
  summarise(v = mean(values))

Joining, by = c("group", "year")
Joining, by = "group"
# A tibble: 3 x 2
   group        v
  <fctr>    <dbl>
1      a 2.666667
2      b       NA
3      c       NA

This can be similarly joined back to full with a left_join or something, but that seems like a not-great idea from a "tidy data" perspective, since these variables are defined at the group level.

conditional average after filtering by another dataframe

1 Answers