1
votes

I have two data.tables:

k1 <- mtcars[1:4,1:6]
k11 <- as.data.table(k1)
k2 <- iris[1:3,1:2]
k22 <- as.data.table(k2)

I am trying to perform some column operation on the first data.table by iterating over rows of columns of second data.table

k3 <- lapply(1:nrow(k2),function(j){
  mpg=k1[,"mpg"]*k2[j,"Sepal.Width"] #get the new value of mpg equals to mpg*first row of second column of second data.frame
  cyl=k1[,"cyl"]*k2[j,"Sepal.Width"]
  a3=k1[,3:6] #all remaining columns over which no operations are done 
  a4<-cbind(mpg,cyl,a3) #cbind these and create a new dataframe for each row of second dataframe. There are three rows and hence there will be three final dataset

})
#rbind all these dataset and get the new dataset
k4<-do.call(rbind,k3)

head(k4)
                  mpg  cyl disp  hp drat    wt
Mazda RX4       73.50 30.6  160 110 3.90 2.620
Mazda RX4 Wag   73.50 30.6  160 110 3.90 2.875
Datsun 710      79.80 20.4  108  93 3.85 2.320
Hornet 4 Drive  74.90 30.6  258 110 3.08 3.215

While, the above solution works perfect, I was wondering:

  1. whether there is an efficiency gain using data.table (since there is no group_by operation here) with the first data frame of 30000 times 10 and the #second data frame of 60 times 4 (final dataset will have 30000 times 60 : 1.8 million rows).
  2. if there is an efficiency gain, how would one obtain the efficiency using data.table:

Following is my solution (similar to data.frame)

k5<-rbindlist(lapply(1:nrow(k2),function(j){

  k11[,`:=`(mpg=mpg*k22[j,Sepal.Width],cyl=cyl*k22[j,Sepal.Length])]

}))

 head(k5)
      mpg     cyl disp  hp drat    wt
1: 705.60 704.718  160 110 3.90 2.620
2: 705.60 704.718  160 110 3.90 2.875
3: 766.08 469.812  108  93 3.85 2.320
4: 719.04 704.718  258 110 3.08 3.215
5: 705.60 704.718  160 110 3.90 2.620
6: 705.60 704.718  160 110 3.90 2.875

As, you can see answers are different(I guess because of nature of copying of data.table).

1

1 Answers

1
votes

You can use set (which would be efficient as the overhead in [.data.table is avoided) and do the * after making the datasets to have the same number of rows,

library(data.table)
k1N <- k11[rep(1:.N,nrow(k22))]
k2N <- k22[rep(1:.N,each=nrow(k11))][, 2:1]

for(j in 1:2){
 set(k1N, i=NULL, j=j, value=k1N[[j]]*k2N[[j]])
}

k1N
#      mpg  cyl disp  hp drat    wt
# 1: 73.50 30.6  160 110 3.90 2.620
# 2: 73.50 30.6  160 110 3.90 2.875
# 3: 79.80 20.4  108  93 3.85 2.320
# 4: 74.90 30.6  258 110 3.08 3.215
# 5: 63.00 29.4  160 110 3.90 2.620
# 6: 63.00 29.4  160 110 3.90 2.875
# 7: 68.40 19.6  108  93 3.85 2.320
# 8: 64.20 29.4  258 110 3.08 3.215
# 9: 67.20 28.2  160 110 3.90 2.620
#10: 67.20 28.2  160 110 3.90 2.875
#11: 72.96 18.8  108  93 3.85 2.320
#12: 68.48 28.2  258 110 3.08 3.215