1
votes

I am trying to return the sum of values of one data frame between two dates in another data frame. The answers provided in Stack don't seem to work for my application. I have tried using data.table but to no avail, so here goes.

Create the Date Ranges

MeanRemaining <- seq(as.Date("2017-01-01"),as.Date("2017-02-28"),2)
MeanRemaining<-as.data.frame(cbind(MeanRemaining,lag(MeanRemaining)))
colnames(MeanRemaining)<-c("InspDate", "PrevInspDate")
MeanRemaining$InspDate<-as.Date(MeanRemaining$InspDate, origin = "1970/01/01")
MeanRemaining$PrevInspDate<-as.Date(MeanRemaining$PrevInspDate, origin = "1970/01/01")

Important to note that the date ranges are not actually fixed like they are above and could be any range of possibilities up to about a week apart.

Create the values to sum

DailyTonnes <- as.data.frame(cbind(as.data.frame(seq(as.Date
+ ("2016-12-01"),as.Date("2017-03-28"),1)),(replicate(1,sample(abs(rnorm(118))*1000,rep=TRUE)))))
colnames(DailyTonnes)<-c("date","Vol")

The Aim

I want to sum 'Vol' from 'DailyTonnes' between each of the date ranges in 'MeanRemaining' and return the total 'Vol' to the corresponding row in 'MeanRemaining'.

With some assistance of similar questions I've tried

library(data.table)
setDT(MeanRemaining)
setDT(DailyTonnes)

MeanRemaining[DailyTonnes[MeanRemaining, sum(Vol), on = .(date >= InspDate, date <= PrevInspDate),
            by = .EACHI], TotalVol := V1, on = .(InspDate=date)]

However this returns NA values.

Any advice would be much appreciated.

1

1 Answers

1
votes

I believe your question had all the pieces you needed for the answer.

I polished your code a bit and change the last line (which was the only wrong one). The join in this last line was overly complicated and I don't think it could bring any memory/performance gain.

library(data.table)
# Create MeanRemaining
MeanRemaining <-
  data.table(InspDate = seq(as.Date("2017-01-01"), as.Date("2017-02-28"), 2))
# I changed lag by shift, I think it is clearer this way
MeanRemaining[, PrevInspDate := shift(InspDate, type = "lead", fill = 1000000L)]

# set seed for repetibility
set.seed(13)
# Create DailyTonnes, I changed the end date to generate empty intervals
DailyTonnes <- data.table(date = seq(as.Date("2016-12-01"),
                                     as.Date("2017-01-28"), 1),
                          Vol = sample(abs(rnorm(118)) * 1000, rep = TRUE))

# I changed the <= condition to <, I think it fits PrevInspDate better
# This should be your final result if I'm not wrong
SingleCase <-
  DailyTonnes[MeanRemaining, sum(Vol), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]

# SingleCase has two variables called date, this may be a small bug in data.table
print(names(SingleCase))

# change the names of the data.table to suit your needs
names(SingleCase) <- c("InspDate", "PrevInspDate", "TotalVol")

Edit: Recover multiple variables from table MeanRemaining

The case in which you retrieve multiple variables from MeanRemaining is quite tricky. It is easily solvable for a small amount of variables:

# Add variables to MeanRemaining
for (i in 1:100) {
  MeanRemaining[, paste0("extracol", i) := sample(.N)]
}

# Two variable case
smallmultiple <-
  DailyTonnes[MeanRemaining, list(TotalVol = sum(Vol),
                                  extracol1 = i.extracol1 ,
                                  extracol2 = i.extracol2), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]

# Correct date names
names(smallmultiple)[1:2] <- c("InspDate", "PrevInspDate")

When it comes to a lot of variable it becomes hard. There is this feature request in github that would solve yout problem but it is not available at the moment. This question faces a similar issue but cannot be used in your case.

The way around for a big amount of variables is:

# obtain names of variables to be kept in the later join
joinkeepcols <-
  setdiff(names(MeanRemaining),  c("InspDate", "PrevInspDate"))

# the "i" indicates the table to take the variables from
joinkeepcols2 <- paste0("i.", joinkeepcols)

# Prepare a expression for the data.table environment
keepcols <-
  paste(paste(joinkeepcols, joinkeepcols2, sep = " = "), collapse = ", ")

# Complete expression to be evaluated in data.table
evalexpression <- paste0("list(
                         TotalVol = sum(Vol),",
                         keepcols, ")")

# The magic comes here (you can assign it to MeanRemaining)
bigmultiple <-
  DailyTonnes[MeanRemaining, eval(parse(text = evalexpression)), on = .(date >= InspDate, date < PrevInspDate), by = .EACHI]

# Correct date names
names(bigmultiple)[1:2] <- c("InspDate", "PrevInspDate")