3
votes

I have a data table similar to this (except it has 150 columns and about 5 million rows):

set.seed(1)
dt <- data.table(ID=1:10, Status=c(rep("OUT",2),rep("IN",2),"ON",rep("OUT",2),rep("IN",2),"ON"), 
             t1=round(rnorm(10),1), t2=round(rnorm(10),1), t3=round(rnorm(10),1), 
             t4=round(rnorm(10),1), t5=round(rnorm(10),1), t6=round(rnorm(10),1),
             t7=round(rnorm(10),1),t8=round(rnorm(10),1))

which outputs:

    ID Status   t1   t2   t3   t4   t5   t6   t7   t8
 1:  1    OUT -0.6  1.5  0.9  1.4 -0.2  0.4  2.4  0.5
 2:  2    OUT  0.2  0.4  0.8 -0.1 -0.3 -0.6  0.0 -0.7
 3:  3     IN -0.8 -0.6  0.1  0.4  0.7  0.3  0.7  0.6
 4:  4     IN  1.6 -2.2 -2.0 -0.1  0.6 -1.1  0.0 -0.9
 5:  5     ON  0.3  1.1  0.6 -1.4 -0.7  1.4 -0.7 -1.3
 6:  6    OUT -0.8  0.0 -0.1 -0.4 -0.7  2.0  0.2  0.3
 7:  7    OUT  0.5  0.0 -0.2 -0.4  0.4 -0.4 -1.8 -0.4
 8:  8     IN  0.7  0.9 -1.5 -0.1  0.8 -1.0  1.5  0.0
 9:  9     IN  0.6  0.8 -0.5  1.1 -0.1  0.6  0.2  0.1
10: 10     ON -0.3  0.6  0.4  0.8  0.9 -0.1  2.2 -0.6

Using data.table, I would like to add a new column (using :=) called Total that would contain the following:

For each row,

if Status=OUT, sum columns t1:t4 and t8

if Status=IN, sum columns t5,t6,t8

if Status=ON, sum columns t1:t3 and t6:t8

The final output should look like this:

    ID Status   t1   t2   t3   t4   t5   t6   t7   t8  Total
 1:  1    OUT -0.6  1.5  0.9  1.4 -0.2  0.4  2.4  0.5   3.7
 2:  2    OUT  0.2  0.4  0.8 -0.1 -0.3 -0.6  0.0 -0.7   0.6
 3:  3     IN -0.8 -0.6  0.1  0.4  0.7  0.3  0.7  0.6   1.6
 4:  4     IN  1.6 -2.2 -2.0 -0.1  0.6 -1.1  0.0 -0.9  -1.4
 5:  5     ON  0.3  1.1  0.6 -1.4 -0.7  1.4 -0.7 -1.3   1.4
 6:  6    OUT -0.8  0.0 -0.1 -0.4 -0.7  2.0  0.2  0.3  -1.0
 7:  7    OUT  0.5  0.0 -0.2 -0.4  0.4 -0.4 -1.8 -0.4  -0.5
 8:  8     IN  0.7  0.9 -1.5 -0.1  0.8 -1.0  1.5  0.0  -0.2
 9:  9     IN  0.6  0.8 -0.5  1.1 -0.1  0.6  0.2  0.1   0.6
10: 10     ON -0.3  0.6  0.4  0.8  0.9 -0.1  2.2 -0.6   2.2

I am fairly new to data.table (currently using version 1.9.6) and would like to try for a solution using efficient data.table syntax.

1
I guess if you only have 3 conditions you could just write a line for each condition, for instance, for "OUT", you could do dt[Status == "OUT", Total := rowSums(.SD), .SDcols = c(1:4, 8) + 2] OR dt[Status == "OUT", Total := Reduce(`+`, .SD), .SDcols = c(1:4, 8) + 2] and similarly for the other two conditions - David Arenburg
@DavidArenburg Thank you so much for your speedy reply. I would like to use column names as my real dataset is large. I tried using .SCcols = c(t1:t4, t8) + 2 but got an error "Error in eval(expr, envir, enclos) : object 't1' not found". How could I do this using column names? Also, used .SDcols = c(3:6, 10) + 2 (#as 3,6,10 are the column numbers) and got error "Error in [.data.table(dt, Status == "OUT", :=(Total, rowSums(.SD)), : .SDcols is numeric but out of bounds (or NA)". So I redid it using .SDcols = c(3:6, 10) (#without the +2 this time) and it worked. What is the +2 for? - FG7
You could write a really long data table expression to do this all in one go, but it may be cleaner and more readable to create 3 tables of your aggregation and then use rbind to put them together. - giraffehere
The +2 in order to skip the first two columns. You could either do c(3:6, 10) or c(1:3, 8) + 2- I was trying to make it simpler for you. Regarding column names, if you want to use the names instead of locations, you need to pass quoted column names if you using c, such as .SDcols = c(paste0("t", 1:4), "t8"). - David Arenburg
@DavidArenburg's is the best approach if there are 3 conditions. You might also consider keying on Status for some efficiency gain. - MichaelChirico

1 Answers

5
votes

I think doing it one by one, as suggested in comments, is perfectly fine, but you can also create a lookup table:

cond = data.table(Status = c("OUT", "IN", "ON"),
                  cols = Map(paste0, 't', list(c(1:4, 8), c(5,6,8), c(1:3, 6:8))))
#   Status              cols
#1:    OUT    t1,t2,t3,t4,t8
#2:     IN          t5,t6,t8
#3:     ON t1,t2,t3,t6,t7,t8

dt[cond, Total := Reduce(`+`, .SD[, cols[[1]], with = F]), on = 'Status', by = .EACHI]
#    ID Status   t1   t2   t3   t4   t5   t6   t7   t8 Total
# 1:  1    OUT -0.6  1.5  0.9  1.4 -0.2  0.4  2.4  0.5   3.7
# 2:  2    OUT  0.2  0.4  0.8 -0.1 -0.3 -0.6  0.0 -0.7   0.6
# 3:  3     IN -0.8 -0.6  0.1  0.4  0.7  0.3  0.7  0.6   1.6
# 4:  4     IN  1.6 -2.2 -2.0 -0.1  0.6 -1.1  0.0 -0.9  -1.4
# 5:  5     ON  0.3  1.1  0.6 -1.4 -0.7  1.4 -0.7 -1.3   1.4
# 6:  6    OUT -0.8  0.0 -0.1 -0.4 -0.7  2.0  0.2  0.3  -1.0
# 7:  7    OUT  0.5  0.0 -0.2 -0.4  0.4 -0.4 -1.8 -0.4  -0.5
# 8:  8     IN  0.7  0.9 -1.5 -0.1  0.8 -1.0  1.5  0.0  -0.2
# 9:  9     IN  0.6  0.8 -0.5  1.1 -0.1  0.6  0.2  0.1   0.6
#10: 10     ON -0.3  0.6  0.4  0.8  0.9 -0.1  2.2 -0.6   2.2