2
votes

I am transitioning from using data.frame in R to data.table for better performance. One of the main segments in converting code was applying custom functions from apply on data.frame to using it in data.table.

Say I have a simple data table, dt1.

x y z---header

1 9 j

4 1 n

7 1 n

Am trying to calculate another new column in dt1, based on values of x,y,z I tried 2 ways, both of them give the correct result, but the faster one spits out a warning. So want to make sure the warning is nothing serious before I use the faster version in converting my existing code.

(1) dt1[,a:={if((x<1) & (y>3) & (j == "n")){6} else {7}}]

(2) dt1[,a:={if((x<1) & (y>3) & (j == "n")){6} else {7}}, by = 1:nrow(x)]

Version 1 runs faster than version 2, but spits out a warning" the condition has length > 1 and only the first element will be used" But the result is good. The second version is slightly slower but doesn't give that warning. I wanted to make sure version one doesn't give erratic results once I start writing complicated functions.

Please treat the question as a generic one with the view to run a user defined function which wants to access different column values in a given row and calculate the new column value for that row.

Thanks for your help.

1

1 Answers

3
votes

If 'x', 'y', and 'z' are the columns of 'dt1', try either the vectorized ifelse

dt1[, a:=ifelse(x<1 & y >3 & z=='n', 6, 7)] 

Or create 'a' with 7, then assign 6 to 'a' based on the logical index.

dt1[, a := 7][x<1 & y >3 & z=='n', a:=6][]

Using a function

getnewvariable <- function(v1, v2, v3){
   ifelse(v1 <1 & v2 >3 & v3=='n', 6, 7)
}

 dt1[, a:=getnewvariable(x,y,z)][]

data

df1 <- structure(list(x = c(0L, 1L, 4L, 7L, -2L), y = c(4L, 9L, 1L, 
1L, 5L), z = c("n", "j", "n", "n", "n")), .Names = c("x", "y", 
"z"), class = "data.frame", row.names = c(NA, -5L))

dt1 <- as.data.table(df1)