Just to illustrate the comments above with an example, let's take
set.seed(10238)
# A and B are the "id" variables within which the
# "data" variables C and D vary meaningfully
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT
# A B C D
# 1: 1 1 14 11
# 2: 1 2 3 8
# 3: 1 3 15 1
# 4: 1 4 1 14
# 5: 1 5 5 9
# 6: 2 1 7 13
# 7: 2 2 2 12
# 8: 2 3 8 6
# 9: 2 4 9 15
# 10: 2 5 4 3
# 11: 3 1 6 5
# 12: 3 2 12 10
# 13: 3 3 10 4
# 14: 3 4 13 7
# 15: 3 5 11 2
Compare the following:
#Sum all columns
DT[ , lapply(.SD, sum)]
# A B C D
# 1: 30 45 120 120
#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
# A B C D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28
#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
# B C D
# 1: 45 120 120
#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
# B C D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14
A few notes:
- You said "does the below snippet... change all the columns in
DT
..."
The answer is no, and this is very important for data.table
. The object returned is a new data.table
, and all of the columns in DT
are exactly as they were before running the code.
- You mentioned wanting to change the column types
Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]
) returns a new data.table
and does not change DT
at all. One (incorrect) way to do this, which is done with data.frame
s in base
, is to overwrite the old data.table
with the new data.table
you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)]
.
This is wasteful because it involves creating copies of DT
which can be an efficiency killer when DT
is large. The correct data.table
approach to this problem is to update the columns by reference using`:=`
, e.g., DT[ , names(DT) := lapply(.SD, as.factor)]
, which creates no copies of your data. See data.table
's reference semantics vignette for more on this.
- You mentioned comparing efficiency of
lapply(.SD, sum)
to that of colSums
. sum
is internally optimized in data.table
(you can note this is true from the output of adding the verbose = TRUE
argument within []
); to see this in action, let's beef up your DT
a bit and run a benchmark:
Results:
library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]
library(microbenchmark)
microbenchmark(
times = 100L,
colsums = colSums(DT[ , !c("A", "B")]),
lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100
# lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100
DT[,colSums(.SD),.SDcols=-c(A, B)]
orDT[,lapply(.SD, sum),.SDcols=-c(A, B)]
– Khashaaby
means that within eachA
xB
pairing, you sum the value of every other column inDT
. @Khashaa's comment is (a few of the ways ) how to sum over all columns exceptingA
andB
, not by group – MichaelChiricoby
means exclude I guess, right? and which one is faster?colSums
orlapply
? – KTYby
does not mean exclude. It's just that the value of .SD refers only to other columns whenby
is used (which is a strange rule, I think). RegardingcolSums
, don't use it, as mentioned at the bottom here: github.com/Rdatatable/data.table/wiki/Do%27s-and-Don%27ts – FrankcolSums
, however I'm curious which is faster if you're not doing it by group, but rather over the whole table. – MichaelChirico