36
votes

I am not very clear about use of .SD and by.

For instance, does the below snippet mean: 'change all the columns in DT to factor except A and B?' It also says in data.table manual: ".SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded?

DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]

However, I also read that by means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like colsum in SQL) over all the columns except A and B do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns A and B? (take sum and group by A,B as in SQL)

DT[,lapply(.SD,sum),by=.(A,B)]

Then how do I do a simple colsum over all the columns except A and B?

1
DT[,colSums(.SD),.SDcols=-c(A, B)] or DT[,lapply(.SD, sum),.SDcols=-c(A, B)]Khashaa
Using by means that within each AxB pairing, you sum the value of every other column in DT. @Khashaa's comment is (a few of the ways ) how to sum over all columns excepting A and B, not by groupMichaelChirico
@MichaelChirico,When I change the column type though as in the first example, by means exclude I guess, right? and which one is faster? colSums or lapply ?KTY
by does not mean exclude. It's just that the value of .SD refers only to other columns when by is used (which is a strange rule, I think). Regarding colSums, don't use it, as mentioned at the bottom here: github.com/Rdatatable/data.table/wiki/Do%27s-and-Don%27tsFrank
@Frank nice reference on colSums, however I'm curious which is faster if you're not doing it by group, but rather over the whole table.MichaelChirico

1 Answers

63
votes

Just to illustrate the comments above with an example, let's take

set.seed(10238)
# A and B are the "id" variables within which the
#   "data" variables C and D vary meaningfully
DT = data.table(
  A = rep(1:3, each = 5L), 
  B = rep(1:5, 3L),
  C = sample(15L),
  D = sample(15L)
)
DT
#     A B  C  D
#  1: 1 1 14 11
#  2: 1 2  3  8
#  3: 1 3 15  1
#  4: 1 4  1 14
#  5: 1 5  5  9
#  6: 2 1  7 13
#  7: 2 2  2 12
#  8: 2 3  8  6
#  9: 2 4  9 15
# 10: 2 5  4  3
# 11: 3 1  6  5
# 12: 3 2 12 10
# 13: 3 3 10  4
# 14: 3 4 13  7
# 15: 3 5 11  2

Compare the following:

#Sum all columns
DT[ , lapply(.SD, sum)]
#     A  B   C   D
# 1: 30 45 120 120

#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
#    A  B  C  D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28

#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
#     B   C   D
# 1: 45 120 120

#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
#    B  C  D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14

A few notes:

  • You said "does the below snippet... change all the columns in DT..."

The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.

  • You mentioned wanting to change the column types

Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].

This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.

  • You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:

Results:

library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]

library(microbenchmark)
microbenchmark(
  times = 100L,
  colsums = colSums(DT[ , !c("A", "B")]),
  lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
#     expr       min        lq      mean    median        uq       max neval
#  colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962   100
#  lapplys  246.5824  250.3753  252.9603  252.1586  254.8297  266.1771   100