how to sum based on three different condition in R

Question

The following is my data.

gcode code year   P  Q
1      101  2000  1  3
1      101  2001  2  4
1      102  2000  1  1
1      102  2001  4  5
1      102  2002  2  6
1      102  2003  6  5
1      103  1999  6  1
1      103  2000  4  2
1      103  2001  2  1
2      104  2000  1  3
2      104  2001  2  4
2      105  2001  4  5
2      105  2002  2  6
2      105  2003  6  5
2      105  2004  6  1
2      106  2000  4  2
2      106  2001  2  1

gcode 1 has 3 different codes 101, 102 and 103. They all have the same year (2000 and 2001). I want to sum up P and Q for these years. Otherwise, I want to delete the irrelevant data. I want to do the same for gcode 2 as well.

How can I get the result like this?

gcode  year   P       Q
1      2000   1+1+4   3+1+2
1      2001   2+4+2   4+5+1
2      2001   2+4+2   4+5+1

please delete the first row "1 2000 5 5",gcode=1 don't have the data in 2000,because code=102 don't have data in 2000 — XUN ZHANG
Yes,thank you very much! Do you know how to make it quickly in R? — XUN ZHANG
Please edit your question to what you would accurately expect. It's confusing otherwise. — Phil
sorry, it's my first time to ask a question. I make a stupid mistake, sorry about that. Now, i guess it is clear. For gcode=1,code=101,102,103 all have data in 2001;gcode =2, is also the same — XUN ZHANG
sorry for all of you guys, i am totally new here. Now i make the last change to my input and output. Thank you very much for your help! — XUN ZHANG

Ronak Shah Ronak Shah · Accepted Answer · 2019-12-23T05:40:37

We can split the data based on gcode subset the data based on common year which is present in all the code and aggregate the data by gcode and year.

do.call(rbind, lapply(split(df, df$gcode), function(x) {
      aggregate(cbind(P, Q)~gcode+year, 
               subset(x, year %in% Reduce(intersect, split(x$year, x$code))), sum)
}))

#    gcode year P  Q
#1.1     1 2000 6  6
#1.2     1 2001 8 10
#2       2 2001 8 10

Using dplyr with similar logic we can do

library(dplyr)
df %>%
  group_split(gcode) %>%
  purrr::map_df(. %>% 
                 group_by(year) %>% 
                 filter(n_distinct(code) == n_distinct(.$code)) %>% 
                 group_by(gcode, year) %>%
                 summarise_at(vars(P:Q), sum))

data

df <- structure(list(gcode = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), code = c(101L, 101L, 102L, 102L, 
102L, 102L, 103L, 103L, 103L, 104L, 104L, 105L, 105L, 105L, 105L, 
106L, 106L), year = c(2000L, 2001L, 2000L, 2001L, 2002L, 2003L, 
1999L, 2000L, 2001L, 2000L, 2001L, 2001L, 2002L, 2003L, 2004L, 
2000L, 2001L), P = c(1L, 2L, 1L, 4L, 2L, 6L, 6L, 4L, 2L, 1L, 
2L, 4L, 2L, 6L, 6L, 4L, 2L), Q = c(3L, 4L, 1L, 5L, 6L, 5L, 1L, 
2L, 1L, 3L, 4L, 5L, 6L, 5L, 1L, 2L, 1L)), class = "data.frame", 
row.names = c(NA, -17L))

how to sum based on three different condition in R

3 Answers