0
votes

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).

set.seed(1)    
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])

In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).

For example, in this case the first entry of this new column would be

sum(daily$value[1:5])

and the second would be

sum(daily$value[2:8]) and so on.

I tried using the following code

intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))

This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.

I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.

Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands

1
wouldn't your first entry of this new column be sum(daily$value[1:6]) and for the second row of the new column sum(daily$value[2:9])? - rmuc8
It could go either way but one of my variables is an id and there is another row somewhere in the dataframe that has the same id and has a start date that is the same as the previous end date so I made the end exclusive to avoid double counting. - stat_student

1 Answers

4
votes
eps <- .Machine$double.eps
library(dplyr)
intervals %>% 
  rowwise() %>% 
  mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
#       start        end     nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481

In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed

library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
#        start        end       V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481