I recently discovered the data.table package and was now wondering whether or not I should replace some of my plyr-code. To summarize, I really like plyr and I basically achieved everything I wanted. However, my code runs a while and the outlook of speeding things up was enough for me to run some tests. Those tests ended quite soon and here is the reason.
What I do quite often with plyr is to split my data by a column containing dates and do some calculations:
library(plyr)
DF <- data.frame(Date=rep(c(Sys.time(), Sys.time() + 60), each=6), y=c(rnorm(6, 1), rnorm(6, -1)))
#Split up data and apply arbitrary function
ddply(DF, .(Date), function(df){mean(df$y) - df[nrow(df), "y"]})
However, using a column with the Date-format does not seem to work in data.table:
library(data.table)
DT <- data.table(Date=rep(c(Sys.time(), Sys.time() + 60), each=6), y=c(rnorm(6, 1), rnorm(6, -1)))
setkey(DT, Date)
#Error in setkey(DT, Date) : Column 'Date' cannot be auto converted to integer without losing information.
If I understand the package correctly, I only get substantial speed-ups when I use setkey(). Also, I think it wouldn't be good coding to constantly convert between Date and numeric. So am I missing something or is there just no easy way to achieve that with data.table?
sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] C
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.6.3 zoo_1.7-2 lubridate_0.2.5 ggplot2_0.8.9 proto_0.3-9.2 reshape_0.8.4
[7] reshape2_1.1 xtable_1.5-6 plyr_1.5.2
loaded via a namespace (and not attached):
[1] digest_0.5.0 lattice_0.19-30 stringr_0.5 tools_2.13.1
POSIXct
datetime value, not aDate
. In particular, the value returned (the number of seconds elapsed since 1/1/1970) is in general not an integral value, so converting to an integer will indeed lose information as the error message says – Hong Ooisetkey
– Andrie