3
votes

By group (group_by(id)), I am trying to sum a variable based on a selection of types. However, there is an order of preference of these types. Example:

library(tidyverse)
df <- data.frame(id = c(rep(1, 6), 2, 2, 2, rep(3, 4), 4, 5),
                 types = c("1a", "1a", "2a", "3b", "4c", "7d",
                          "4c", "7d", "7d","4c", "5d", "6d", "6d","5d","7d"),
                 x = c(10, 15, 20, 15, 30, 40,
                       10, 10, 15, 10, 10, 10, 10, 10, 10),
                 y = c(1:15),
                 z = c(1:15)
)
df
#    id types  x  y  z
# 1   1    1a 10  1  1
# 2   1    1a 15  2  2
# 3   1    2a 20  3  3
# 4   1    3b 15  4  4
# 5   1    4c 30  5  5
# 6   1    7d 40  6  6
# 7   2    4c 10  7  7
# 8   2    7d 10  8  8
# 9   2    7d 15  9  9
# 10  3    4c 10 10 10
# 11  3    5d 10 11 11
# 12  3    6d 10 12 12
# 13  3    6d 10 13 13
# 14  4    5d 10 14 14
# 15  5    7d 10 15 15

I want to sum(x) based on types preferences in this order:

preference_1st = c("1a", "2a", "3b")
preference_2nd = c("7d")
preference_3rd = c("4c", "5d", "6d")

So this means that if an id contains any types in preference_1st we sum them and ignore the other types, if theres none from preference_1st, we sum all preference_2nd and ignore the rest. And finally, if theres only types from preference_3rd we sum these. So for id=1, we want to ignore types 4c and 7d. (I also want more straightforward calculations of other variables, z and y in this example).

Desired output:

desired
  id sumtest ymean zmean
1  1      60   3.5   3.5
2  2      25   8.0   8.0
3  3      40  11.5  11.5
4  4      10  14.0  14.0
5  5      10  15.0  15.0

I think one possible option would be to use mutate and case_when to create some sort of order variable but i think there should be a better when with if statements? The following is close but doesn't distinguish between preferences properly:

df %>%
  group_by(id) %>%
  summarise(sumtest = if (any(types %in% preference_1st)) {
    sum(x)
  } else if (any(!types %in% preference_1st) & any(types %in% preference_2nd)) {
    sum(x)
  } else {
    sum(x)
  },
            ymean = mean(y),
            zmean = mean(z))
#      id sumtest ymean zmean
#   <dbl>   <dbl> <dbl> <dbl>
# 1     1     130   3.5   3.5
# 2     2      35   8     8  
# 3     3      40  11.5  11.5
# 4     4      10  14    14  
# 5     5      10  15    15  

Open to other approaches too? Any suggestions?

thanks

3

3 Answers

1
votes

Here's a dplyr solution:

df %>% 
  group_by(id) %>%
  mutate(ymean = mean(y), zmean = mean(z), 
         pref = 3 * types %in% preference_3rd + 
                2 * types %in% preference_2nd +
                1 * types %in% preference_1st ) %>%
  filter(pref == min(pref)) %>%
  summarise(sumtest = sum(x), ymean = first(ymean), zmean = first(zmean))
#> # A tibble: 5 x 4
#>      id sumtest ymean zmean
#>   <dbl>   <dbl> <dbl> <dbl>
#> 1     1      60   3.5   3.5
#> 2     2      25   8     8  
#> 3     3      40  11.5  11.5
#> 4     4      10  14    14  
#> 5     5      10  15    15 
1
votes

Use reduce and anti_join to filter data iteratively.

pref <- list(c("1a", "2a", "3b"), c("7d"), c("4c", "5d", "6d"))

pref %>%
  map(~ df %>% filter(types %in% .x)) %>%
  reduce(~ anti_join(.y, .x, by = "id") %>% bind_rows(.x, .)) %>%
  group_by(id) %>%
  summarise(sumtest = sum(x)) %>%
  left_join(df %>% group_by(id) %>% summarise(ymean = mean(y), zmean = mean(z)))

# # A tibble: 5 x 4
#      id sumtest ymean zmean
#   <dbl>   <dbl> <dbl> <dbl>
# 1     1      60   3.5   3.5
# 2     2      25   8     8  
# 3     3      40  11.5  11.5
# 4     4      10  14    14  
# 5     5      10  15    15   
0
votes

I forgot to subset sum(x) in my original attempt in the if statement although I prefer the above solutions

df %>%
  group_by(id) %>%
  summarise(sumtest = if (any(types %in% preference_1st)) {
    sum(x[types %in% preference_1st])

  } else if (any(!types %in% preference_1st) & any(types %in% preference_2nd)) {
    sum(x[types %in% preference_2nd])

  } else {
    sum(x[types %in% preference_3rd])

  },
  ymean = mean(y),
  zmean = mean(z))
#      id sumtest ymean zmean
#   <dbl>   <dbl> <dbl> <dbl>
# 1     1      60   3.5   3.5
# 2     2      25   8     8  
# 3     3      40  11.5  11.5
# 4     4      10  14    14  
# 5     5      10  15    15