1
votes

I have 2 sets of points, set1 and set2. Both sets of points have a data associated with the point. Points in set1 are "ephemeral", and only exist on the given date. Points in set2 are "permanent", are constructed at a given date, and then exist forever after that date.

set.seed(1)
dates <- seq(as.Date('2011-01-01'),as.Date('2011-12-31'),by='days')

set1 <- data.frame(lat=40+runif(10000),
lon=-70+runif(10000),date=sample(dates,10000,replace=TRUE))

set2 <- data.frame(lat=40+runif(100),
lon=-70+runif(100),date=sample(dates,100,replace=TRUE))

Here's my problem: For each point in set1 (ephemeral) find the distance to the closest point in set2 (permanent) that was constructed BEFORE the event is set1 occurred. For example, the 1st point in set1 occurred on 2011-03-18:

> set1[1,]
       lat       lon       date
1 40.26551 -69.93529 2011-03-18

So I want to find the closest point in set2 that was constructed before 2011-03-18:

> head(set2[set2$date<=as.Date('2011-04-08'),])
        lat       lon       date
1  40.41531 -69.25765 2011-02-18
7  40.24690 -69.29812 2011-02-19
13 40.10250 -69.52515 2011-02-12
14 40.53675 -69.28134 2011-02-27
17 40.66236 -69.07396 2011-02-17
20 40.67351 -69.88217 2011-01-04

The additional wrinkle is that these are latitude/longitude points, so I have to calculate distances along the surface of the earth. The R package fields provides a convienent function to do this:

require(fields)
distMatrix <- rdist.earth(set1[,c('lon','lat')], 
set2[,c('lon','lat')], miles = TRUE)

My question is, how can I adjust the distances in this matrix to Inf if the point in set2 (column of distance matrix) was constructed after the point in set1 (row of distances matrix)?

2

2 Answers

3
votes

Here is what I would do:

earlierMatrix <- outer(set1$date, set2$date, "<=")
distMatrix2 <- distMatrix + ifelse(earlierMatrix, Inf, 0)
0
votes

Here's my attempt at an answer. It's not particularly efficient, but I think it is correct. It also allows you to easily sub in different distance calculators:

#Calculate distances 
require(fields)
distMatrix <- lapply(1:nrow(set1),function(x) {

    #Find distances to all points
    distances <- rdist.earth(set1[x,c('lon','lat')], set2[,c('lon','lat')], miles = TRUE)

    #Set distance to Inf if the set1 point occured BEFORE the set2 dates
    distances <- ifelse(set1[x,'date']<set2[,'date'], Inf, distances)

    return(distances)
})
distMatrix  <- do.call(rbind,distMatrix)

#Find distance to closest object
set1$dist <- apply(distMatrix,1,min)

#Find id of closest object
objectID <- lapply(1:nrow(set1),function(x) {
    if (set1[x,'dist']<Inf) {
        IDs <- which(set1[x,'dist']==distMatrix[x,])
    } else {
        IDs <- NA
    }
    return(sample(IDs,1)) #Randomly break ties (if there are any)
})
set1$objectID <- do.call(rbind,objectID)

Here's the head of the resulting dataset:

> head(set1)
       lat       lon       date      dist objectID
1 40.26551 -69.93529 2011-03-18  3.215514       13
2 40.37212 -69.32339 2011-02-11 10.320910       46
3 40.57285 -69.26463 2011-02-23  3.954132        4
4 40.90821 -69.88870 2011-04-24  4.132536       49
5 40.20168 -69.95335 2011-02-24  4.284692       45
6 40.89839 -69.86909 2011-07-12  3.385769       57