7
votes

I have two data tables with x,y coordinates and some other info which I would like to merge based on nearest neighbour distance, i.e. on the minimum in squared difference of both x and y (dx_i =min ([(x_i-x_j)^2+(y_i-y_j)^2]^0.5). Say I have the following two sets:

DT1=data.table(x=1:5,y=3:7)    
DT2=data.table(x=c(2,4,2,3,6),y=c(2.5,3.1,2,3,5),Q=c('a','b','c','d','e'))

Then the desired result of the merge would be:

   x y Q
1: 1 3 a
2: 2 4 d
3: 3 5 d
4: 4 6 e
5: 5 7 e

I could of course write a loop over DT1 to calculate the nearest neighbour for each row in DT1 and then merge based on this calculation, but that seems to defeat the purpose of data tables. Moreover, that will be very slow for data tables of several million rows.

I know that for a single column I could do a nearest neighbour merge like this

DT2[DT1,roll="nearest"]

But that (logically) doesn't work when I define 2 keys (x and y) for the tables to be merged. Does a similar syntax for a 2-parameter nearest neighbour merge exist? If not, is there a smarter way to do this then just looping, like I mentioned?

1
The reason it's only done for 1 column, is because for 1 column you can sort it in both tables, making it easy to do a rolling merge. For 2 or more columns you can't do that (there is no sort order) and it's a whole new type of an operation.eddi
@eddi Yes, I understand that the operation will be of a different type. I was just wondering whether this (I think common) type of merge was implemented in some smart wayMichiel
I haven't thought too much about it - but if you come up with a smart algorithm to do it, you can certainly add a suggestion or a pull request on github.eddi

1 Answers

4
votes

One possible solution:

func = function(u,v)
{
    vec = with(DT2, (u-x)^2 + (v-y)^2)
    DT2[which.min(vec),]$Q
}

transform(DT1, Q=apply(DT1, 1, function(u) func(u[1], u[2])))

#   x y Q
#1: 1 3 a
#2: 2 4 d
#3: 3 5 d
#4: 4 6 e
#5: 5 7 e