1
votes

The following is my data.

gcode code year   P  Q
1      101  2000  1  3
1      101  2001  2  4
1      102  2000  1  1
1      102  2001  4  5
1      102  2002  2  6
1      102  2003  6  5
1      103  1999  6  1
1      103  2000  4  2
1      103  2001  2  1
2      104  2000  1  3
2      104  2001  2  4
2      105  2001  4  5
2      105  2002  2  6
2      105  2003  6  5
2      105  2004  6  1
2      106  2000  4  2
2      106  2001  2  1

gcode 1 has 3 different codes 101, 102 and 103. They all have the same year (2000 and 2001). I want to sum up P and Q for these years. Otherwise, I want to delete the irrelevant data. I want to do the same for gcode 2 as well.

How can I get the result like this?

gcode  year   P       Q
1      2000   1+1+4   3+1+2
1      2001   2+4+2   4+5+1
2      2001   2+4+2   4+5+1
3
please delete the first row "1 2000 5 5",gcode=1 don't have the data in 2000,because code=102 don't have data in 2000 - XUN ZHANG
Yes,thank you very much! Do you know how to make it quickly in R? - XUN ZHANG
Please edit your question to what you would accurately expect. It's confusing otherwise. - Phil
sorry, it's my first time to ask a question. I make a stupid mistake, sorry about that. Now, i guess it is clear. For gcode=1,code=101,102,103 all have data in 2001;gcode =2, is also the same - XUN ZHANG
sorry for all of you guys, i am totally new here. Now i make the last change to my input and output. Thank you very much for your help! - XUN ZHANG

3 Answers

4
votes

We can split the data based on gcode subset the data based on common year which is present in all the code and aggregate the data by gcode and year.

do.call(rbind, lapply(split(df, df$gcode), function(x) {
      aggregate(cbind(P, Q)~gcode+year, 
               subset(x, year %in% Reduce(intersect, split(x$year, x$code))), sum)
}))

#    gcode year P  Q
#1.1     1 2000 6  6
#1.2     1 2001 8 10
#2       2 2001 8 10

Using dplyr with similar logic we can do

library(dplyr)
df %>%
  group_split(gcode) %>%
  purrr::map_df(. %>% 
                 group_by(year) %>% 
                 filter(n_distinct(code) == n_distinct(.$code)) %>% 
                 group_by(gcode, year) %>%
                 summarise_at(vars(P:Q), sum))

data

df <- structure(list(gcode = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), code = c(101L, 101L, 102L, 102L, 
102L, 102L, 103L, 103L, 103L, 104L, 104L, 105L, 105L, 105L, 105L, 
106L, 106L), year = c(2000L, 2001L, 2000L, 2001L, 2002L, 2003L, 
1999L, 2000L, 2001L, 2000L, 2001L, 2001L, 2002L, 2003L, 2004L, 
2000L, 2001L), P = c(1L, 2L, 1L, 4L, 2L, 6L, 6L, 4L, 2L, 1L, 
2L, 4L, 2L, 6L, 6L, 4L, 2L), Q = c(3L, 4L, 1L, 5L, 6L, 5L, 1L, 
2L, 1L, 3L, 4L, 5L, 6L, 5L, 1L, 2L, 1L)), class = "data.frame", 
row.names = c(NA, -17L))
4
votes

An option using data.table package:

years <- DT[, {
    m <- min(year)
    ty <- tabulate(year-m)
    .(year=which(ty==uniqueN(code)) + m)
}, gcode]

DT[years, on=.(gcode, year),
    by=.EACHI, .(P=sum(P), Q=sum(Q))]

output:

   gcode year P  Q
1:     1 2000 6  6
2:     1 2001 8 10
3:     2 2001 8 10

data:

library(data.table)
DT <- fread("gcode code year   P  Q
1      101  2000  1  3
1      101  2001  2  4
1      102  2000  1  1
1      102  2001  4  5
1      102  2002  2  6
1      102  2003  6  5
1      103  1999  6  1
1      103  2000  4  2
1      103  2001  2  1
2      104  2000  1  3
2      104  2001  2  4
2      105  2001  4  5
2      105  2002  2  6
2      105  2003  6  5
2      105  2004  6  1
2      106  2000  4  2
2      106  2001  2  1")
3
votes

I came up with the following solution. First, I counted how many times each year appear for each gcode. I also counted how many unique codes exist for each gcode. Then, join the two results using left_join(). Then, I identified rows that have same values in n_year and n_code. Then, I joined the original data frame, which is called mydf. Then, I defined groups by gcode and year, and summed up P and Q for each group.

library(dplyr)

left_join(count(mydf, gcode, year, name = "n_year"),
          group_by(mydf, gcode) %>% summarize(n_code = n_distinct(code))) %>% 
filter(n_year == n_code) %>% 
left_join(mydf, by = c("gcode", "year")) %>% 
group_by(gcode, year) %>% 
summarize_at(vars(P:Q),
             .funs = list(~sum(.)))

#  gcode  year     P     Q
#  <int> <int> <int> <int>
#1     1  2000     6     6
#2     1  2001     8    10
#3     2  2001     8    10

Another idea

I was reviewing this question later and came up with the following idea, which is much simpler. First, I defined groups by gcode and year. For each group, I counted how many data points existed using add_count(). Then, I defined groups again with gcode only. For each gcode group, I wanted to get rows that meet n == n_distinct(code). n is a column created by add_count(). If a number in n and a number returned by n_distinct() matches, that means that a year in that row exists among all code. Finally, I defined groups by gcode and year again and summed up values in P and Q.

group_by(mydf, gcode, year) %>% 
add_count() %>% 
group_by(gcode) %>% 
filter(n == n_distinct(code)) %>%
group_by(gcode, year) %>% 
summarize_at(vars(P:Q),
             .funs = list(~sum(.)))

# This is the same code in data.table.
setDT(mydf)[, check := .N, by = .(gcode, year)][,
            .SD[check == uniqueN(code)], by = gcode][,
            lapply(.SD, sum), .SDcols = P:Q, by = .(gcode, year)][]