2
votes

I want to sum column values in a table DT2 based on row values of a different table DT1 and store it in DT3

library(data.table)
DT1 <- data.table(x = seq(1,100, by = 1), 
                 y = round(runif(100,50, 60)),
                 z = round(runif(100, 61, 70)))

DT2 <- data.table(a = seq(1, 1000, by = 1),
                 b = round(rnorm(1000, 10,1)),
                 c = round(rnorm(1000, 20,1)),
                 d = round(rnorm(1000, 30,1)))

Initialize DT3

DT3 <- DT1[,.(x)]
DT3[,':='(e = 0,
          f = 0,
          g = 0)]

Using for loop for summing values

for(i in 1:nrow(DT1)){
  DT3[i, 2:4] <- DT2[DT1[i,y]:DT1[i,z], 
                     lapply(.SD, sum),
                     .SDcols = c("b", "c", "d")]
}

This works fine. The only issue is it takes lot of time as the number of rows and columns increases in DT1 and DT2. I have DT1 with more than 100000 rows. This takes me around 30 mins. Is there a faster way to do this operation?

2

2 Answers

1
votes

You can try this:

library(data.table)
DT1 <- data.table(x = seq(1,100, by = 1), 
                 y = round(runif(100,50, 60)),
                 z = round(runif(100, 61, 70)))

DT2 <- data.table(a = seq(1, 1000, by = 1),
                 b = round(rnorm(1000, 10,1)),
                 c = round(rnorm(1000, 20,1)),
                 d = round(rnorm(1000, 30,1)))

matrix(DT1[, colSums(DT2[, -"a"][as.numeric(y:z)]), 1:nrow(DT1)]$V1,
       ncol = 3, byrow = TRUE)

  #      [,1] [,2] [,3]
  # [1,]  114  223  324
  # [2,]   73  141  208
  # [3,]  112  222  323
  # [4,]  151  301  443
  # [5,]  133  263  382
  # ...

Iterate over DT1 rows and extract vector y:z (as.numeric(y:z)). Pass this vector to DT2 and get colSums.

1
votes

In terms of single iterations, your code is already efficient. You are using data.table instead of dataframe, and the lapply statement is efficient as well.

Thus, I believe the major optimization should be focused on how you apply that sum statement.

1. Avoid duplicate calculations:

Check the results of table(duplicated(DT1[,c("y", "z")])) on your data. For the dummy data case, 30% of the summing operations are repeated. If you your data has such high duplication ratio, I would suggest calculating the results of the unique y & z first, then assigning the results to the main datatable afterwards.
.

2. Parallelize your loop: Your iterations are independent, which makes the loop parallizable. Check the foreach package. Be careful with the .combine parameter / function though, as this might tremendously reduce the result combination overheads.