9
votes

I’ve devised a solution to lookup values from multiple columns of two separate data tables and add a new column based calculations of their values (multiple conditional comparisons). Code below. It involves using a data.table and join while calculating values from both tables, however, the tables aren’t joined on the columns I’m comparing, and therefore I suspect I may not be getting the speed advantages inherent to data.tables that I’ve read so much about and am excited about tapping into. Said another way, I’m joining on a ‘dummy’ column, so I don’t think I’m joining “properly.”

The exercise is, given an X by X grid dtGrid and a list of X^2 random Events dtEvents within the grid, to determine how many Events occur within a 1 unit radius of each grid point. The code is below. I picked a grid size of 100 X 100, which takes ~1.5 sec to run the join on my machine. But I can’t go much bigger without introducing an enormous performance hit (200 X 200 takes ~22 sec).

I really like the flexibility of being able to add multiple conditions to my val statement (e.g., if I wanted to add a bunch of AND and OR combinations I could do that), so I'd like to retain that functionality.

Is there a way to use data.table joins ‘properly’ (or any other data.table solution) to achieve a much speedier / efficient outcome?

Thanks so much!

#Initialization stuff
library(data.table)
set.seed(77L)

#Set grid size constant
#Increasing this number to a value much larger than 100 will result in significantly longer run times
cstGridSize = 100L

#Create Grid
vecXYSquare <- seq(0, cstGridSize, 1)
dtGrid <- data.table(expand.grid(vecXYSquare, vecXYSquare))
setnames(dtGrid, 'Var1', 'x')
setnames(dtGrid, 'Var2', 'y')
dtGrid[, DummyJoin:='A']
setkey(dtGrid, DummyJoin)

#Create Events
xrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
yrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
dtEvents <- data.table(x=xrand, y=yrand)
dtEvents[, DummyJoin:='A']
dtEvents[, Counter:=1L]
setkey(dtEvents, DummyJoin)

#Return # of events within 1 unit radius of each grid point
system.time(
    dtEventsWithinRadius <- dtEvents[dtGrid, {
        val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];  #basic circle fomula: x^2 + y^2 = radius^2
        list(col_i.x=i.x, col_i.y=i.y, EventsWithinRadius=sum(val))
    }, by=.EACHI]
)
1
Frank: Guilty as charged. You're absolutely right. It should say an X+1 by X+1 grid ... I wanted all the Event points to fit within the grid, so I had to include the 0 X and Y grid points. That said, the problem I'm trying to solve is minimally affected by this change...the # of events and the grid size are somewhat arbitrary, other than that they're both sizable. Thanks for the correction.ColoradoGranite
For what it's worth, if you are able to switch the criterion from a unit circle to +/-1 on each dimension independently, it can be a lot faster: system.time(dtEvents[, { L = lapply(.SD, function(x) rep(as.integer(floor(x)), each=4L)); .( x = L[[1]] + 0:1, y = L[[2]] + rep(0:1, each=2L) ) }, .SDcols = x:y][, .N, by=x:y]). That's what I thought you were doing initially.Frank

1 Answers

13
votes

Very interesting question.. and great use of by = .EACHI! Here's another approach using the NEW non-equi joins from the current development version, v1.9.7.

Issue: Your use of by=.EACHI is completely justified because the other alternative is to perform a cross join (each row of dtGrid joined to all rows of dtEvents) but that's too exhaustive and is bound to explode very quickly.

However by = .EACHI is performed along with an equi-join using a dummy column, which results in computing all distances (except that it does one at a time, therefore memory efficient). That is, in your code, for each dtGrid, all possible distances are still computed with dtEvents; hence it doesn't scale as well as expected.

Strategy: Then you'd agree that an acceptable improvement is to restrict the number of rows that would result from joining each row of dtGrid to dtEvents.

Let (x_i, y_i) come from dtGrid and (a_j, b_j) come from from dtEvents, say, where 1 <= i <= nrow(dtGrid) and 1 <= j <= nrow(dtEvents). Then, i = 1 implies, all j that satisfies (x1 - a_j)^2 + (y1 - b_j)^2 < 1 needs to be extracted. That can only happen when:

(x1 - a_j)^2 < 1 AND (y1 - b_j)^2 < 1

This helps reduce the search space drastically because, instead of looking at all rows in dtEvents for each row in dtGrid, we just have to extract those rows where,

a_j - 1 <= x1 <= a_j + 1 AND b_j - 1 <= y1 <= b_j + 1
# where '1' is the radius

This constraint can be directly translated to a non-equi join, and combined with by = .EACHI as before. The only additional step required is to construct the columns a_j-1, a_j+1, b_j-1, b_j+1 as follows:

foo1 <- function(dt1, dt2) {
    dt2[, `:=`(xm=x-1, xp=x+1, ym=y-1, yp=y+1)]                   ## (1) 
    tmp = dt2[dt1, on=.(xm<=x, xp>=x, ym<=y, yp>=y), 
              .(sum((i.x-x)^2+(i.y-y)^2<1)), by=.EACHI, 
              allow=TRUE, nomatch=0L
          ][, c("xp", "yp") := NULL]                              ## (2)
    tmp[]
}

## (1) constructs all columns necessary for non-equi joins (since expressions are not allowed in the formula for on= yet.

## (2) performs a non-equi join that computes distances and checks for all distances that are < 1 on the restricted set of combinations for each row in dtGrid -- hence should be much faster.

Benchmarks:

# Here's your code (modified to ensure identical column names etc..):
foo2 <- function(dt1, dt2) {
    ans = dt2[dt1, 
                {
                 val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2];
                 .(xm=i.x, ym=i.y, V1=sum(val))
                }, 
            by=.EACHI][, "DummyJoin" := NULL]
    ans[]
}

# on grid size of 100:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.166s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 1.626s

# on grid size of 200:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 0.983s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 31.038s

# on grid size of 300:
system.time(ans1 <- foo1(dtGrid, dtEvents)) # 2.847s
system.time(ans2 <- foo2(dtGrid, dtEvents)) # 151.32s

identical(ans1[V1 != 0]L, ans2[V1 != 0L]) # TRUE for all of them

The speedups are ~10x, 32x and 53x respectively.

Note that the rows in dtGrid for which the condition is not satisfied even for a single row in dtEvents will not be present in the result (due to nomatch=0L). If you want those rows, you'll have to also add one of the xm/xp/ym/yp cols.. and check them for NA (= no matches).

This is the reason we had to remove all 0 counts to get identical = TRUE.

HTH

PS: See history for another variation where the entire join is materialised and then the distance is computed and counts generated.