1
votes

I'm currently working in data.table in R with the following data set:

    id   age_start   age_end    cases
    1    2           2          1000          
    1    3           3          500           
    1    4           4          300           
    1    2           4          1800            
    2    2           2          8000          
    2    3           3          200           
    2    4           4          100           
  • In the given data set I only want values of cases where the age_start == 2 and the age_end ==4.
  • In each ID where the age_start !=2 and the age_end !=4, I need to sum or aggregate the rows to create a group of age_start==2 and age_end ==4. In these cases I'd need to sum up the cases of age_start==2 & age_end==2, age_start==3 & age_end==3, as well as age_start==4 & age_end==4 into one new row of age_start==2 and age_end==4.
  • After these are summed up into one row, I want to drop the rows that I used to make the new age_start==2 and age_start==4 row (i.e. the age values 2-2, 3-3, and 4-4) as they are no longer needed

Ideally the data set would look like this when I finish these steps:

    id   age_start   age_end    cases
    1    2           4          1800          
    2    2           4          8300

Any suggestions on how to accomplish this in data.table are greatly appreciated!

2

2 Answers

3
votes

You can use an equi-join for the first bullet; and a non-equi join for the second:

m_equi = x[.(id = unique(id), age_dn = 2, age_up = 4), 
  on=.(id, age_start = age_dn, age_end = age_up),
  nomatch=0
]

m_nonequi = x[!m_equi, on=.(id)][.(id = unique(id), age_dn = 2, age_up = 4), 
  on=.(id, age_start >= age_dn, age_end <= age_up), 
  .(cases = sum(cases)), by=.EACHI
]

res = rbind(m_equi, m_nonequi)

   id age_start age_end cases
1:  1         2       4  1800
2:  2         2       4  8300

How it works:

  • x[i] uses values in i to look up rows and columns in x according to rules specified in on=.

  • nomatch=0 means unmatched rows of i in x[i] are dropped, so m_equi only ends up with id=1.

  • x[!m_equi, on=.(id)] is an anti-join that skips id=1 since we already matched it in the equi join.

  • by=.EACHI groups by each row of i in x[i] for the purpose of doing the aggregation.

An alternative would be to anti-join on rows with start 2 and end 4 so that all groups need to be aggregated (similar to @akrun's answer), though I guess that would be less efficient.

3
votes

We can specify the i with the logical condition, grouped by 'id', get the sum of 'cases' while adding 'age_start', 'age_end' as 2 and 4

library(data.table)
as.data.table(df1)[age_start != 2|age_end != 4, 
     .(age_start = 2, age_end = 4, cases = sum(cases)), id]
#  id age_start age_end cases
#1:  1         2       4  1800
#2:  2         2       4  8300

data

df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), age_start = c(2L, 
3L, 4L, 2L, 2L, 3L, 4L), age_end = c(2L, 3L, 4L, 4L, 2L, 3L, 
4L), cases = c(1000L, 500L, 300L, 1800L, 8000L, 200L, 100L)), 
  class = "data.frame", row.names = c(NA, 
-7L))