6
votes

I am definitely confused on why accessing a data.table by row index is slower than data.frame. Any suggestions how i can access each row of data.table sequentially in loop that is faster?

m = matrix(1L, nrow=100000, ncol=100)

DF = as.data.frame(m)
DT = as.data.table(m)

identical(DF[100, ], DT[100, ])
[1] FALSE

> all(DF[100, ], DT[100, ])
[1] TRUE

> system.time(for (i in 1:1000) DT[i,])
   user  system elapsed 
  5.440   0.000   5.451 

R> system.time(for (i in 1:1000) DF[i,])
   user  system elapsed 
  2.757   0.000   2.784 
1
The simplest explanation is [.data.table does a lot more things than [.data.frame. - Arun
How may iterate the rows of the data.frame by row index faster then ? - user3147662
I've created a FR #5260 here. Thanks for reporting. It should be possible to gain more speed. - Arun
@user3147662, why don't you provide more information about the problem you are trying to solve by iterating through rows of a data.table? You can do amazing powerful things without explicit iteration. Also, you should probably do that as a separate question. - BrodieG
A nice starting point would be to edit your post with what your actual task is, clearly, and with producible examples, and showing your output. - Arun

1 Answers

7
votes

A data.table query has more arguments (and it does more) so the small overhead of DT[...] is larger than DF[...]. This overhead adds up if you loop it. The intended use of data.table is to have it execute a large complex operation few times, rather than small trivial calculations multiple times. So let's reformulate your test:

> system.time(DT[seq(len=nrow(m)),])
 user  system elapsed 
0.08    0.02    0.09 
> system.time(DF[seq(len=nrow(m)),])
 user  system elapsed 
0.08    0.05    0.13 

Here, they are about the same. Since we only have one DT call, the overhead isn't that apparent because the overhead is only executed once. In your case you executed it 100K times (unnecessarily, I might add). If you are using data.table and you are making calls to it thousands of times, you are probably using it wrong. There almost certainly is a way to reformulate so you can have just one or a few data.table calls that do the same thing.

Also, note that even my reformulated test here is pretty trivial, which is why data.table performs comparably to data.frame.