I have a data table similar to this (except it has 150 columns and about 5 million rows):
set.seed(1)
dt <- data.table(ID=1:10, Status=c(rep("OUT",2),rep("IN",2),"ON",rep("OUT",2),rep("IN",2),"ON"),
t1=round(rnorm(10),1), t2=round(rnorm(10),1), t3=round(rnorm(10),1),
t4=round(rnorm(10),1), t5=round(rnorm(10),1), t6=round(rnorm(10),1),
t7=round(rnorm(10),1),t8=round(rnorm(10),1))
which outputs:
ID Status t1 t2 t3 t4 t5 t6 t7 t8
1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
3: 3 IN -0.8 -0.6 0.1 0.4 0.7 0.3 0.7 0.6
4: 4 IN 1.6 -2.2 -2.0 -0.1 0.6 -1.1 0.0 -0.9
5: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
6: 6 OUT -0.8 0.0 -0.1 -0.4 -0.7 2.0 0.2 0.3
7: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
8: 8 IN 0.7 0.9 -1.5 -0.1 0.8 -1.0 1.5 0.0
9: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
10: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
Using data.table, I would like to add a new column (using :=) called Total that would contain the following:
For each row,
if Status=OUT, sum columns t1:t4 and t8
if Status=IN, sum columns t5,t6,t8
if Status=ON, sum columns t1:t3 and t6:t8
The final output should look like this:
ID Status t1 t2 t3 t4 t5 t6 t7 t8 Total
1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5 3.7
2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7 0.6
3: 3 IN -0.8 -0.6 0.1 0.4 0.7 0.3 0.7 0.6 1.6
4: 4 IN 1.6 -2.2 -2.0 -0.1 0.6 -1.1 0.0 -0.9 -1.4
5: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3 1.4
6: 6 OUT -0.8 0.0 -0.1 -0.4 -0.7 2.0 0.2 0.3 -1.0
7: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4 -0.5
8: 8 IN 0.7 0.9 -1.5 -0.1 0.8 -1.0 1.5 0.0 -0.2
9: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1 0.6
10: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6 2.2
I am fairly new to data.table (currently using version 1.9.6) and would like to try for a solution using efficient data.table syntax.
"OUT", you could dodt[Status == "OUT", Total := rowSums(.SD), .SDcols = c(1:4, 8) + 2]ORdt[Status == "OUT", Total := Reduce(`+`, .SD), .SDcols = c(1:4, 8) + 2]and similarly for the other two conditions - David Arenburg[.data.table(dt, Status == "OUT",:=(Total, rowSums(.SD)), : .SDcols is numeric but out of bounds (or NA)". So I redid it using .SDcols = c(3:6, 10) (#without the +2 this time) and it worked. What is the +2 for? - FG7+2in order to skip the first two columns. You could either doc(3:6, 10)orc(1:3, 8) + 2- I was trying to make it simpler for you. Regarding column names, if you want to use the names instead of locations, you need to pass quoted column names if you usingc, such as.SDcols = c(paste0("t", 1:4), "t8"). - David ArenburgStatusfor some efficiency gain. - MichaelChirico