6
votes

Another novice question regarding big data. I'm working with a large dataset (3.5m rows) with time series data. I want to create a data.table with a column that finds the first time the unique identifier appears.

df is a data.table, df$timestamp is a date in class POSIXct, and df$id is the unique numeric identifier. I'm using the following code:

# UPDATED - DATA KEYED
setkey(df, id)
sub_df<-df[,(min(timestamp)), by=list(id)] # Finding first timestamp for each unique ID

Here's the catch. I'm aggregating over 80k unique ID's. R is choking. Anything I can do to optimize my approach?

4
On which command is R slow? I can't tell from your description.Blue Magister
Is your data.table keyed? if not, that will help significantly.Justin
On another note: you don't need to say df$timestamp in J. Just min(timestamp) should be sufficient.Arun
Why not use !duplicated(df$id) (only if you data set is ordered by date)?Rcoster
@Arun Keying the data.table, apparently.Blue Magister

4 Answers

6
votes

Here's a little code to test which behaviour takes a LOT of time

require(data.table)
dt <- data.table(sample(seq(as.Date("2012-01-01"), as.Date("2013-12-31"), 
          by="days"), 1e5, replace=T), val=sample(1e4, 1e5, replace = T))

FUN1 <- function() {
    out <- dt[, min(dt$V1), by=val]  # min of entire V1 for each group i.e. wrong
}

FUN2 <- function() {
    out <- dt[, min(V1), by=val]     # min of V1 within group as intended
}

require(rbenchmark)
> benchmark(FUN1(), FUN2(), replications = 1, order="elapsed")
#     test replications elapsed relative user.self sys.self user.child sys.child
# 2 FUN2()            1   0.271    1.000     0.242    0.002          0         0
# 1 FUN1()            1  38.378  141.616    32.584    4.153          0         0

It is very clear that FUN2() is blazing fast. Remember, in both cases, KEY wasn't set

5
votes

As mentioned by @Arun, the real key (no pun intended) is the use of proper data.table syntax rather than setkey.

df[, min(timestamp), by=id]

While 80k unique ids sounds like a lot, using the key feature of data.table can make it a manageable prospect.

setkey(df, id)

Then process as before. For what its worth, you can often use a pleasant side effect of keys which is sorting.

set.seed(1)
dat <- data.table(x = sample(1:10, 10), y = c('a', 'b'))

    x y
 1:  3 a
 2:  4 b
 3:  5 a
 4:  7 b
 5:  2 a
 6:  8 b
 7:  9 a
 8:  6 b
 9: 10 a
10:  1 b

setkey(dat, y, x)

     x y
 1:  2 a
 2:  3 a
 3:  5 a
 4:  9 a
 5: 10 a
 6:  1 b
 7:  4 b
 8:  6 b
 9:  7 b
10:  8 b

Then the min or another more complex function is just a subset operation:

dat[, .SD[1], by=y]
4
votes

In complement to Arun's answer, here's something with a data set of similar size to OP's (3.5M rows, 80K ID's) that shows that keyed/unkeyed aggregation is not too different. So the speedup may be due to avoiding the $ operator.

set.seed(10)
eg <- function(x) data.table(id=sample(8e4,x,replace=TRUE),timestamp=as.POSIXct(runif(x,min=ISOdatetime(2013,1,1,0,0,0) - 60*60*24*30, max=ISOdatetime(2013,1,1,0,0,0)),origin="1970-01-01"))
df <- eg(3.5e6)
dfk <- copy(df)
setkey(dfk,id)
require(microbenchmark)
microbenchmark(
    unkeyed = df[,min(timestamp),by=id][,table(weekdays(V1))]
    ,keyed = dfk[,min(timestamp),by=id][,table(weekdays(V1))]
    ,times=5
)
#Unit: seconds
#     expr      min       lq   median       uq      max
#1   keyed 7.330195 7.381879 7.476096 7.486394 7.690694
#2 unkeyed 7.882838 7.888880 7.924962 7.927297 7.931368

Edit from Matthew.

Actually the above is almost entirely to do with the type POSIXct.

> system.time(dfk[,min(timestamp),by=id])
   user  system elapsed 
   8.71    0.02    8.72 
> dfk[,timestamp:=as.double(timestamp)]  # discard POSIXct type to demonstrate
> system.time(dfk[,min(timestamp),by=id])
   user  system elapsed 
   0.14    0.02    0.15     # that's more like normal data.table speed

Reverting to POSIXct and using Rprof shows it's 97% inside min() for that type (i.e. nothing to do with data.table) :

$by.total
                total.time total.pct self.time self.pct
system.time           8.70    100.00      0.00     0.00
[.data.table          8.64     99.31      0.12     1.38
[                     8.64     99.31      0.00     0.00
min                   8.46     97.24      0.46     5.29
Summary.POSIXct       8.00     91.95      0.86     9.89
do.call               5.86     67.36      0.26     2.99
check_tzones          5.46     62.76      0.20     2.30
unique                5.26     60.46      2.04    23.45
sapply                3.74     42.99      0.46     5.29
simplify2array        2.38     27.36      0.16     1.84
NextMethod            1.28     14.71      1.28    14.71
unique.default        1.10     12.64      0.92    10.57
lapply                1.10     12.64      0.76     8.74
unlist                0.60      6.90      0.28     3.22
FUN                   0.24      2.76      0.24     2.76
match.fun             0.22      2.53      0.22     2.53
is.factor             0.18      2.07      0.18     2.07
parent.frame          0.14      1.61      0.14     1.61
gc                    0.06      0.69      0.06     0.69
duplist               0.04      0.46      0.04     0.46
[.POSIXct             0.02      0.23      0.02     0.23

Notice the object size of dfk :

> object.size(dfk)
40.1 Mb

Nothing should take 7 seconds in data.table for this tiny size! It needs to be 100 times larger (4GB), with a not-flawed j, and then you can see the difference between keyed by and ad hoc by.

Edit from Blue Magister:

Taking Matthew Dowle's answer into account, there is a difference between keyed/unkeyed commands.

df <- eg(3.5e6)
df[,timestamp := as.double(timestamp)]
dfk <- copy(df)
setkey(dfk,id)
require(microbenchmark)
microbenchmark(
    unkeyed = df[,min(timestamp),by=id][,table(weekdays(as.POSIXct(V1,origin="1970-01-01")))]
    ,keyed = dfk[,min(timestamp),by=id][,table(weekdays(as.POSIXct(V1,origin="1970-01-01")))]
    ,times=10
)
#Unit: milliseconds
#     expr      min       lq   median       uq       max
#1   keyed 340.3177 346.8308 348.7150 354.7337  358.1348
#2 unkeyed 886.1687 888.7061 901.1527 945.6190 1036.3326
2
votes

Here's a data.table version

dtBy <- function(dt)
    dt[, min(timestamp), by=id]

Going a little old-school, here's a function that returns the minimum of each group

minBy <- function(x, by) {
    o <- order(x)
    by <- by[o]
    idx <- !duplicated(by)
    data.frame(by=by[idx], x=x[o][idx])
}

and seems to have reasonable performance for the sample data of BlueMagister

> system.time(res0 <- dtBy(dt))
   user  system elapsed 
 11.165   0.216  11.894 
> system.time(res1 <- minBy(dt$timestamp, dt$id))
   user  system elapsed 
  4.784   0.036   4.836 
> all.equal(res0[order(res0$id),], res1[order(res1$by),],
+           check.attributes=FALSE)
[1] TRUE

Edit from Matthew (mostly)

Yes, this is because minBy avoids min() on the POSIXct type, which is the terribly slow part to repeat. This is nothing to do with data.table.

Using the dfk from Blue M's answer :

dfk[,timestamp:=as.double(timestamp)]  # discard POSIXct type to demonstrate

system.time(res0 <- dtBy(dfk))
   user  system elapsed 
   0.16    0.02    0.17 
system.time(res1 <- minBy(dfk$timestamp, dfk$id))
   user  system elapsed 
   4.87    0.04    4.92 

Now the old school method looks very slow in comparison to data.table. All the time was being spent in min() on the POSIXct type. See edit on Arun's answer for the Rprof output.