Few notes here:
- As it stands now, using
data.table
for your need could be an overkill (though not necessarily) and you could probably avoid it.
- You are growing objects in a loop (
Column <- c(Column, x)
)- don't do that. In your case there is no need. Just create an empty vector of zeroes and you can get rid of most of your function.
- There is absolutely no need in creating
Column2
- it is just z
- as R automatically will recycle it in order to fit it to the correct size
- No need to recalculate
nrow(addTable)
by row neither, that could be just an additional parameter.
- Your bigggest overhead is calling data.table:::`[.data.table` per row- it is a very expensive function. The
:=
function has a very little overhead here. If you''ll replace addTable[, First := First + Column] ; addTable[, Second := Second + Column2]
with just addTable$First + Column ; addTable$Second + Column2
the run time will be reduced from ~35 secs to ~2 secs. Another way to illustrate this is by replacing the two lines with set
- e.g. set(addTable, j = "First", value = addTable[["First"]] + Column) ; set(addTable, j = "Second", value = addTable[["Second"]] + Column)
which basically shares the source code with :=
. This also runs ~ 2 secs
- Finally, it is better to reduce the amount of operations per row. You could try accumulating the result using
Reduce
instead of updating the actual data set per row.
Let's see some examples
Your original function timings
library(data.table)
dt <- data.table(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))
addTable <- data.table(First=0, Second=0, Term=c(1:50))
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
addTable[, First := First + Column]
addTable[, Second := Second + Column2]
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
30 secs is pretty slow...
1- Let's try removing the data.table:::`[.data.table` overhead
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
addTable$First + Column
addTable$Second + Column2
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
^ That was much faster but didn't update the actual data set.
2- Now let's try replacing it with set
which will have the same affect as :=
but without the data.table:::`[.data.table` overhead
sample_fun <- function(x, y, z, n) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
set(addTable, j = "First", value = addTable[["First"]] + Column)
set(addTable, j = "Second", value = addTable[["Second"]] + Column2)
}
system.time(mapply(sample_fun, dt$X, dt$Y, dt$Z))
# user system elapsed
# 2.96 0.00 2.96
^ Well, that was also much faster than 30 secs and had the exact same effect as :=
3- Let's try it without using data.table
at all
dt <- data.frame(X= c(1:100), Y=c(.5, .7, .3, .4), Z=c(1:50000))
addTable <- data.frame(First=0, Second=0, Term=c(1:50))
sample_fun <- function(x, y, z) {
Column <- NULL
while(x>=1) {
x <- x*y
Column <- c(Column, x)
}
length(Column) <- nrow(addTable)
Column[is.na(Column)] <- 0
Column2 <- NULL
Column2 <- rep(z, length(Column))
return(list(Column, Column2))
}
system.time(res <- mapply(sample_fun, dt$X, dt$Y, dt$Z))
^ That's even faster
Now we can use Reduce
combined with accumulate = TRUE
in order to create those vectors
system.time(addTable$First <- Reduce(`+`, res[1, ], accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.06
system.time(addTable$Second <- Reduce(`+`, res[2, ], accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.06
Well, everything combined is now under 2 seconds (instead of 30 with your original function).
4- Further improvements could be to fix the other elements in your function (as pointed above), in other words, your function could be just
sample_fun <- function(x, y, n) {
Column <- numeric(n)
i <- 1L
while(x >= 1) {
x <- x * y
Column[i] <- x
i <- i + 1L
}
return(Column)
}
system.time(res <- Map(sample_fun, dt$X, dt$Y, nrow(addTable)))
^ Twice improvement in speed
Now, we didn't even bother creating Column2
as we already have it in dt$Z
. We also used Map
instead of mapply
as it will be easier for Reduce
to work with a list
than a matrix
.
The next step is similar to as before
system.time(addTable$First <- Reduce(`+`, res, accumulate = TRUE)[[nrow(dt)]])
# user system elapsed
# 0.07 0.00 0.07
But we could improve this even further. Instead of using Map
/Reduce
we could create a matrix
using mapply
and then run matrixStats::rowCumsums
over it (which is written in C++ internally) in order to calculate addTable$First
)
system.time(res <- mapply(sample_fun, dt$X, dt$Y, nrow(addTable)))
system.time(addTable$First2 <- matrixStats::rowCumsums(res)[, nrow(dt)])
While the final step is simply summing dt$Z
system.time(addTable$Second <- sum(dt$Z))
So eventually we went from ~30 secs to less than a second.
Some final notes
- As it seems like the main overhead remained in the function itself, you could also maybe try rewriting it using Rcpp as it seems like loops are inevitable in this case (though the overhead is not so big it seems).
mapply
is per rowfor
loop). In other words, it is not directly related to the:=
operator rather to how often and how expensive is a function being evaluated. Instead of removing the last rows entirely, you should have just remove the:=
operators and test just foraddTable[, First + Column] ; addTable[, Second + Column]
- then you will notice that the:=
overhead is not so big. Also, calling[.data.table
twice is expensive. You could update both columns at once in each iteration. – David ArenburgColumn2
is calculated for? You are not using it anywhere – David Arenburgreturn(Column)
instead of them. Then, runaddTable[, First2 := Reduce(`+`, Map(sample_fun, dt$X, dt$Y, dt$Z), accumulate = TRUE)[[nrow(dt)]]]
. This will take less than a second on your 50K rows. – David Arenburgreturn(Column)
toreturn(list(Column, Column2))
within the function. Then,system.time(res <- mapply(sample_fun, dt$X, dt$Y, dt$Z)) ; addTable[, `:=`(First2 = Reduce(`+`, res[1, ], accumulate = TRUE)[[nrow(dt)]], Second2 = Reduce(`+`, res[2, ], accumulate = TRUE)[[nrow(dt)]])]
– David Arenburg