Summing a dataframe based on another dataframe

Question

I have daily data of rainfall from 10 locations across 10 years

set.seed(123)

df <- data.frame(loc.id = rep(1:10, each = 10*365),years = rep(rep(2001:2010,each = 365),times = 10),
             day = rep(rep(1:365,times = 10),times = 10), rain = runif(min = 0 , max = 35, 10*10*365))

I have a separate data frame that has certain days using which I want to sum the rainfall in df

df.ref <- data.frame(loc.id = rep(1:10, each = 10), 
                 years = rep(2001:2010,times = 10),
                 index1 = rep(250,times = 10*10),
                 index2 = sample(260:270, size = 10*10,replace = T),
                 index3 = sample(280:290, size = 10*10,replace = T),
                 index4 = sample(291:300, size= 10*10,replace = T))

df.ref

    loc.id years index1 index2 index3 index4
1:      1  2001    250    264    280    296
2:      1  2002    250    269    284    298
3:      1  2003    250    268    289    293
4:      1  2004    250    266    281    295
5:      1  2005    250    260    289    293

What I want to is for row in in df.ref, use the index values in df.ref and sum the rainfall in df between index1 to index2, index1 to index3 and index1 to index4. For example:

Using df.ref, for loc.id = 1 and year == 2001, sum the rainfall in df from 250 to 264, 250 to 280, 250 to 296 (as shown in df.ref) Similarly, for year 2002, for loc.id = 1, sum the rainfall from 250 to 269, 250 to 284, 250 to 298.

I did this:

library(dplyr)  

ptm <- proc.time()

dat <- df.ref  %>% left_join(df)

index1.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index2) %>% summarise(sum.rain1  = sum(rain))

index2.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index3) %>% summarise(sum.rain2 = sum(rain))

index3.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index4) %>% summarise(sum.rain3 = sum(rain))

all.index <- index1.cal %>% left_join(index2.cal) %>% left_join(index3.cal))

 proc.time() - ptm

user  system elapsed 
2.36    0.64    3.06

I am looking to make my code faster since my actual df.ref is quite large. Could anyone advise me how to make this quicker.

There are multiple records for same id and year subset(df.ref, loc.id == 1 & years == 2001) should it be like that? — pogibas
"sum the rainfall in df from 250 to 264, 250 to 280, 250 to 296" Do you mean sum the rainfall in df from rows 250 to 264, 250 to 280, etc.? — Maurits Evers
@PoGibas sorry there was an error in the data. I have fixed. It should only have one record for one id and year — 89_Simple

Tung Tung · Accepted Answer · 2018-03-31T02:36:20

Non-equi join from data.table package can be both faster and more memory efficient than dplyr::left_join (slide | video)

For each value in df, find all the rain values in df.ref that have day in between index 1 and index 2. Then calculate the summation of rain based on loc.id and years.

df1 <- unique(df[df.ref
                , .(rain)
                , on = .(loc.id, years, day >= index1, day <= index2)
                , by = .EACHI][
                  ][
                  , c("sum_1") := .(sum(rain)), by = .(loc.id, years)][
                  # remove all redundant columns
                  , day := NULL][
                  , day := NULL][
                  , rain := NULL])

df2 <- unique(df[df.ref
                , .(rain)
                , on = .(loc.id, years, day >= index1, day <= index3)
                , by = .EACHI][
                  ][
                  , c("sum_2") := .(sum(rain)), by = .(loc.id, years)][
                  , day := NULL][
                  , day := NULL][
                  , rain := NULL])

df3 <- unique(df[df.ref
                , .(rain)
                , on = .(loc.id, years, day >= index1, day <= index4)
                , by = .EACHI][
                  ][
                  , c("sum_3") := .(sum(rain)), by = .(loc.id, years)][
                  , day := NULL][
                  , day := NULL][
                  , rain := NULL])

Merge all three data.tables together

df1[df2, on = .(loc.id, years)][
  df3, on = .(loc.id, years)]

     loc.id years     sum_1    sum_2    sum_3
  1:      1  1950 104159.11 222345.4 271587.1
  2:      1  1951 118689.90 257450.2 347624.3
  3:      1  1952  99262.27 212923.7 280877.6
  4:      1  1953  72435.50 192072.7 251593.6
  5:      1  1954 104021.19 242525.3 326463.4
  6:      1  1955  93436.32 232653.1 304921.4
  7:      1  1956  89122.79 190424.4 255535.0
  8:      1  1957 135658.11 262918.7 346361.4
  9:      1  1958  80064.18 220454.8 292966.4
 10:      1  1959 114231.19 273181.0 349489.2
 11:      2  1950  94360.69 238296.8 301751.8
 12:      2  1951  93845.50 195273.7 289686.0
 13:      2  1952 107692.53 245019.4 308093.7
 14:      2  1953  86650.14 257225.1 332674.1
 15:      2  1954 104085.83 238859.4 286350.7
 16:      2  1955 101602.16 223107.3 300958.4
 17:      2  1956  73912.77 198087.2 276590.1
 18:      2  1957 117780.86 228299.8 305348.5
 19:      2  1958  98625.45 220902.6 291583.7
 20:      2  1959 109851.38 266745.2 324246.8
 [ reached getOption("max.print") -- omitted 81 rows ]

Compare processing time and memory used

> time_dplyr; time_datatable
   user  system elapsed 
   2.17    0.27    2.61 
   user  system elapsed 
   0.45    0.00    0.69 

  rowname      Class  MB
1     dat data.frame 508
2     df3 data.table  26
3     df2 data.table  20
4     df1 data.table   9

When testing for about 100 years of data, dplyr used more than 50 GB of memory while data.table consumed only 5 GB. dplyr also took about 4 times longer to finish.

'data.frame':   3650000 obs. of  4 variables:
 $ loc.id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ years : int  1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
 $ day   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ rain  : num  10.1 27.6 14.3 30.9 32.9 ...
'data.frame':   3650000 obs. of  6 variables:
 $ loc.id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ years : int  1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 ...
 $ index1: num  250 250 250 250 250 250 250 250 250 250 ...
 $ index2: int  270 265 262 267 266 265 262 268 260 268 ...
 $ index3: int  290 287 286 289 281 285 286 285 284 283 ...
 $ index4: int  298 297 296 295 298 294 296 298 298 300 ...

> time_dplyr; time_datatable
   user  system elapsed
 95.010  33.704 128.722
   user  system elapsed
 26.175   3.147  29.312

  rowname      Class    MB
1     dat data.frame 50821
2     df3 data.table  2588
3     df2 data.table  2004
4     df1 data.table   888
5  df.ref data.table    97
6      df data.table    70

If I increased the number of years to 150, dplyr broke even on a HPC cluster node with 256 GB RAM

Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) :
  negative length vectors are not allowed
Calls: %>% ... left_join -> left_join.tbl_df -> left_join_impl -> .Call
Execution halted

Summing a dataframe based on another dataframe

2 Answers