I’ve devised a solution to lookup values from multiple columns of two separate data tables and add a new column based calculations of their values (multiple conditional comparisons). Code below. It involves using a data.table and join while calculating values from both tables, however, the tables aren’t joined on the columns I’m comparing, and therefore I suspect I may not be getting the speed advantages inherent to data.tables that I’ve read so much about and am excited about tapping into. Said another way, I’m joining on a ‘dummy’ column, so I don’t think I’m joining “properly.”
The exercise is, given an X by X grid dtGrid
and a list of X^2 random Events dtEvents
within the grid, to determine how many Events occur within a 1 unit radius of each grid point. The code is below. I picked a grid size of 100 X 100, which takes ~1.5 sec to run the join on my machine. But I can’t go much bigger without introducing an enormous performance hit (200 X 200 takes ~22 sec).
I really like the flexibility of being able to add multiple conditions to my val
statement (e.g., if I wanted to add a bunch of AND and OR combinations I could do that), so I'd like to retain that functionality.
Is there a way to use data.table joins ‘properly’ (or any other data.table solution) to achieve a much speedier / efficient outcome?
Thanks so much!
#Initialization stuff
library(data.table)
set.seed(77L)
#Set grid size constant
#Increasing this number to a value much larger than 100 will result in significantly longer run times
cstGridSize = 100L
#Create Grid
vecXYSquare <- seq(0, cstGridSize, 1)
dtGrid <- data.table(expand.grid(vecXYSquare, vecXYSquare))
setnames(dtGrid, 'Var1', 'x')
setnames(dtGrid, 'Var2', 'y')
dtGrid[, DummyJoin:='A']
setkey(dtGrid, DummyJoin)
#Create Events
xrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
yrand <- runif(cstGridSize^2, 0, cstGridSize + 1)
dtEvents <- data.table(x=xrand, y=yrand)
dtEvents[, DummyJoin:='A']
dtEvents[, Counter:=1L]
setkey(dtEvents, DummyJoin)
#Return # of events within 1 unit radius of each grid point
system.time(
dtEventsWithinRadius <- dtEvents[dtGrid, {
val = Counter[(x - i.x)^2 + (y - i.y)^2 < 1^2]; #basic circle fomula: x^2 + y^2 = radius^2
list(col_i.x=i.x, col_i.y=i.y, EventsWithinRadius=sum(val))
}, by=.EACHI]
)
system.time(dtEvents[, { L = lapply(.SD, function(x) rep(as.integer(floor(x)), each=4L)); .( x = L[[1]] + 0:1, y = L[[2]] + rep(0:1, each=2L) ) }, .SDcols = x:y][, .N, by=x:y])
. That's what I thought you were doing initially. – Frank