In complement to Arun's answer, here's something with a data set of similar size to OP's (3.5M rows, 80K ID's) that shows that keyed/unkeyed aggregation is not too different. So the speedup may be due to avoiding the $
operator.
set.seed(10)
eg <- function(x) data.table(id=sample(8e4,x,replace=TRUE),timestamp=as.POSIXct(runif(x,min=ISOdatetime(2013,1,1,0,0,0) - 60*60*24*30, max=ISOdatetime(2013,1,1,0,0,0)),origin="1970-01-01"))
df <- eg(3.5e6)
dfk <- copy(df)
setkey(dfk,id)
require(microbenchmark)
microbenchmark(
unkeyed = df[,min(timestamp),by=id][,table(weekdays(V1))]
,keyed = dfk[,min(timestamp),by=id][,table(weekdays(V1))]
,times=5
)
#Unit: seconds
# expr min lq median uq max
#1 keyed 7.330195 7.381879 7.476096 7.486394 7.690694
#2 unkeyed 7.882838 7.888880 7.924962 7.927297 7.931368
Edit from Matthew.
Actually the above is almost entirely to do with the type POSIXct
.
> system.time(dfk[,min(timestamp),by=id])
user system elapsed
8.71 0.02 8.72
> dfk[,timestamp:=as.double(timestamp)] # discard POSIXct type to demonstrate
> system.time(dfk[,min(timestamp),by=id])
user system elapsed
0.14 0.02 0.15 # that's more like normal data.table speed
Reverting to POSIXct and using Rprof shows it's 97% inside min()
for that type (i.e. nothing to do with data.table
) :
$by.total
total.time total.pct self.time self.pct
system.time 8.70 100.00 0.00 0.00
[.data.table 8.64 99.31 0.12 1.38
[ 8.64 99.31 0.00 0.00
min 8.46 97.24 0.46 5.29
Summary.POSIXct 8.00 91.95 0.86 9.89
do.call 5.86 67.36 0.26 2.99
check_tzones 5.46 62.76 0.20 2.30
unique 5.26 60.46 2.04 23.45
sapply 3.74 42.99 0.46 5.29
simplify2array 2.38 27.36 0.16 1.84
NextMethod 1.28 14.71 1.28 14.71
unique.default 1.10 12.64 0.92 10.57
lapply 1.10 12.64 0.76 8.74
unlist 0.60 6.90 0.28 3.22
FUN 0.24 2.76 0.24 2.76
match.fun 0.22 2.53 0.22 2.53
is.factor 0.18 2.07 0.18 2.07
parent.frame 0.14 1.61 0.14 1.61
gc 0.06 0.69 0.06 0.69
duplist 0.04 0.46 0.04 0.46
[.POSIXct 0.02 0.23 0.02 0.23
Notice the object size of dfk
:
> object.size(dfk)
40.1 Mb
Nothing should take 7 seconds in data.table
for this tiny size! It needs to be 100 times larger (4GB), with a not-flawed j
, and then you can see the difference between keyed by and ad hoc by.
Edit from Blue Magister:
Taking Matthew Dowle's answer into account, there is a difference between keyed/unkeyed commands.
df <- eg(3.5e6)
df[,timestamp := as.double(timestamp)]
dfk <- copy(df)
setkey(dfk,id)
require(microbenchmark)
microbenchmark(
unkeyed = df[,min(timestamp),by=id][,table(weekdays(as.POSIXct(V1,origin="1970-01-01")))]
,keyed = dfk[,min(timestamp),by=id][,table(weekdays(as.POSIXct(V1,origin="1970-01-01")))]
,times=10
)
#Unit: milliseconds
# expr min lq median uq max
#1 keyed 340.3177 346.8308 348.7150 354.7337 358.1348
#2 unkeyed 886.1687 888.7061 901.1527 945.6190 1036.3326
data.table
keyed? if not, that will help significantly. – Justindf$timestamp
inJ
. Justmin(timestamp)
should be sufficient. – Arundata.table
, apparently. – Blue Magister