Can someone please explain the following output? Unless I am missing something (which I probably am), it seems that the speed of subsetting a data.table depends on the specific values stored in one of the columns, even when they are of the same class and have no apparent differences other than value.
How is this possible?
> dim(otherTest)
[1] 3572069 2
> dim(test)
[1] 3572069 2
> length(unique(test$keys))
[1] 28741
> length(unique(otherTest$keys))
[1] 28742
> sapply(test,class)
thingy keys
"character" "character"
> sapply(otherTest,class)
thingy keys
"character" "character"
> class(test)
[1] "data.table" "data.frame"
> class(otherTest)
[1] "data.table" "data.frame"
> start = Sys.time()
> newTest = otherTest[keys%in%partition]
> end = Sys.time()
> print(end - start)
Time difference of 0.5438871 secs
> start = Sys.time()
> newTest = test[keys%in%partition]
> end = Sys.time()
> print(end - start)
Time difference of 42.78009 secs
Summary EDIT: So the difference in speed does not have to do with different sized data.tables, nor does it have to do with different numbers of unique values. As you can see in my revised example above, even after generating keys so that they had the same number of unique values (and are in the same general range and share at least 1 value, but are in general different), I am getting the same performance difference.
Regarding sharing the data, I unfortunately can't share the test table, but I could share otherTest. The whole idea is that I was trying to replicate the test table as closely as possible (same size, same classes/types, same keys, number of NA values, etc) so that I could post to SO -- but then strangely my made up data.table behaved very differently and I can't figure out why!
Also, I will add that the only reason I suspected the problem was coming from data.table is that a couple of weeks ago I ran into a similar problem with subsetting a data.table that turned out to be an actual bug in the new data.table release (I ended up deleting the question because it was a duplicate). The bug also involved using the %in% function to subset a data.table -- if there were duplicate entried in the right argument of %in%, it was returning duplicated output. so if x = c(1,2,3) and y = c(1,1,2,2), x%in% y would return a vector of length 8. I have resinstalled the data.table package, so I don't think it could be the same bug -- but perhaps related?
EDIT (re Dean MacGregor's comment)
> sapply(test,class)
thingy keys
"character" "character"
> sapply(otherTest,class)
thingy keys
"character" "character"
# benchmarking the original test table
> test2 =data.table(sapply(test ,as.numeric))
> otherTest2 =data.table(sapply(otherTest ,as.numeric))
> start = Sys.time()
> newTest = test[keys%in%partition])
> end = Sys.time()
> print(end - start)
Time difference of 52.68567 secs
> start = Sys.time()
> newTest = otherTest[keys%in%partition]
> end = Sys.time()
> print(end - start)
Time difference of 0.3503151 secs
#benchmarking after converting to numeric
> partition = as.numeric(partition)
> start = Sys.time()
> newTest = otherTest2[keys%in%partition]
> end = Sys.time()
> print(end - start)
Time difference of 0.7240109 secs
> start = Sys.time()
> newTest = test2[keys%in%partition]
> end = Sys.time()
> print(end - start)
Time difference of 42.18522 secs
#benchmarking again after converting back to character
> partition = as.character(partition)
> otherTest2 =data.table(sapply(otherTest2 ,as.character))
> test2 =data.table(sapply(test2 ,as.character))
> start = Sys.time()
> newTest =test2[keys%in%partition]
> end = Sys.time()
> print(end - start)
Time difference of 48.39109 secs
> start = Sys.time()
> newTest = data.table(otherTest2[keys%in%partition])
> end = Sys.time()
> print(end - start)
Time difference of 0.1846113 secs
So the slowdown does not depend on class.
EDIT: The problem is clearly coming from data.table, because I can convert to matrix and the problem disappears, and then convert back to data.table and the problem comes back.
EDIT: I noticed that the problem has to do with how the data.table function is treating duplicates, which sounds about right because it is similar to the bug I found last week in data table 1.9.4 that I described above.
> newTest =test[keys%in%partition]
> end = Sys.time()
> print(end - start)
Time difference of 39.19983 secs
> start = Sys.time()
> newTest =otherTest[keys%in%partition]
> end = Sys.time()
> print(end - start)
Time difference of 0.3776946 secs
> sum(duplicated(test))/length(duplicated(test))
[1] 0.991954
> sum(duplicated(otherTest))/length(duplicated(otherTest))
[1] 0.9918879
> otherTest[duplicated(otherTest)] =NA
> test[duplicated(test)]= NA
> start = Sys.time()
> newTest =otherTest[keys%in%partition]
> end = Sys.time()
> print(end - start)
Time difference of 0.2272599 secs
> start = Sys.time()
> newTest =test[keys%in%partition]
> end = Sys.time()
> print(end - start)
Time difference of 0.2041721 secs
So even though they have the same number of duplicates, the two data.tables (or more specifically the %in% function inside the data.table) are clearly treating the duplicates differently. Another interesting observation related to duplicates is this (note I am starting with the original tables here again):
> start = Sys.time()
> newTest =test[keys%in%unique(partition)]
> end = Sys.time()
> print(end - start)
Time difference of 0.6649222 secs
> start = Sys.time()
> newTest =otherTest[keys%in%unique(partition)]
> end = Sys.time()
> print(end - start)
Time difference of 0.205637 secs
So removing duplicates from the right argument to %in% also fixes the problem.
So given this new info, the question still remains: why are these two data.tables treating duplicated values differently?

keyscolumns? - Mike Wise