I'm a beginner with R and data.tables, but have read enough to be convinced of their speed / efficiency for large data sets. I've searched high and low and can't find the answer to what I would think is an easy problem.
Here's the question: Given two data tables DT1 and DT2
DT1 <- data.table(AA=c("A","B","C","A","B","C","A","B","C","A","B","C"),
BB=c(35,45,25,25,85,15,55,55,95,35,25,75)
)
DT2 <- data.table(CC=c("A","B","C","A","B","C"),
DD=c(10,20,30,40,50,60),
EE=c(5,5,10,10,15,20)
)
How do I add a new column (called NewCol) to DT1 that sums column EE from DT2 if multiple conditions are met given the existing values for each row of DT1:
1) Column AA = Column CC
2) Column DD >= Column BB
Said another way, sum EE where AA = CC AND DD >= BB.
The desired output is:
DT1_DesiredOutput <- data.table(AA=c("A","B","C","A","B","C","A","B","C","A","B","C"),
BB=c(35,45,25,25,85,15,55,55,95,35,25,75),
NewCol=c(10,15,30,10,0,30,0,0,0,10,15,0)
)
Join? Merge? Set? Something else? Guessing it's inefficient to loop through each row of DT1?
I'd also like to know how to make it scaleable, that is, adding several conditions (e.g., several greater than, less than, equal to values mixing "ANDs" and "ORs").
Thanks so much!
p.s., this is a piece of cake in Excel (cough suboptimal, I know), with an array formula such as
{=SUM(IF( ($F$7:$F$12=B7)*($G$7:$G$12>C7), $H$7:$H$12))}
copied in each cell of NewCol. I didn't want to clutter the post with the two Excel tables, and obviously Excel isn't great for lots of reasons, but if it's that easy in Excel, it must be comparably easy in R data.tables, right?