0
votes

I have two dataframes: df1 contains observations with lat-lon coordinates; df2 has names with lat-lon coordinates. I want to create a new variable df1$names which has for each observation the names of df2 that are within a specified distance to that observation.

Some sample data for df1:

df1 <- structure(list(lat = c(52.768, 53.155, 53.238, 53.253, 53.312, 53.21, 53.21, 53.109, 53.376, 53.317, 52.972, 53.337, 53.208, 53.278, 53.316, 53.288, 53.341, 52.945, 53.317, 53.249), lon = c(6.873, 6.82, 6.81, 6.82, 6.84, 6.748, 6.743, 6.855, 6.742, 6.808, 6.588, 6.743, 6.752, 6.845, 6.638, 6.872, 6.713, 6.57, 6.735, 6.917), cat = c(2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 3L, 2L, 2L, 2L, 2L, 2L), diff = c(6.97305555555555, 3.39815972222222, 14.2874305555556, -0.759791666666667, 34.448275462963, 4.38783564814815, 0.142430555555556, 0.698599537037037, 1.22914351851852, 7.0008912037037, 1.3349537037037, 8.67978009259259, 1.6090162037037,    25.9466782407407, 9.45068287037037, 4.76284722222222, 1.79163194444444, 16.8280787037037, 1.01336805555556, 3.51240740740741)), .Names = c("lat", "lon", "cat", "diff"), row.names = c(125L, 705L, 435L, 682L, 186L, 783L, 250L, 517L, 547L, 369L, 618L, 280L, 839L, 614L, 371L, 786L, 542L, 100L, 667L, 785L), class = "data.frame")

Some sample data for df2:

df2 <- structure(list(latlonloc = structure(c(6L, 3L, 4L, 2L, 5L, 1L), .Label = c("Boelenslaan", "Borgercompagnie", "Froombosch", "Garrelsweer", "Stitswerd", "Tinallinge"), class = "factor"), lat = c(53.356789, 53.193886, 53.311237, 53.111339, 53.360848, 53.162031), lon = c(6.53493, 6.780792, 6.768608, 6.82354, 6.599604, 6.143804)), .Names = c("latlonloc", "lat", "lon"), class = "data.frame", row.names = c(NA, -6L))

Creating a distance matrix with the geosphere package:

library(geosphere)
mat <- distm(df1[,c('lon','lat')], df2[,c('lon','lat')], fun=distHaversine)

The resulting distances are in meters (at least I think they are, else something is wrong with the distance matrix).

The specified distance is calculated with (df1$cat)^2)*1000. I tried df1$names <- df2$latlonloc[apply(distmat, 1, which(distmat < ((df1$cat)^2)*1000 ))], but get an error message:

Error in match.fun(FUN) : 
  'which(distmat < ((df1$cat)^2) * 1000)' is not a function, character or symbol

This is probably not the correct appraoch, but what I need is this:

df1$names <- #code or function which gives me a string of names which are within a specified distance of the observation

How can I create a string with the names that are within a specified distance of the observations?

1
When you call apply, the third argument needs to be a function, either simply the function name or a complete function with an argument that will correspond to each element of the first argument to apply. You need something like apply(distmat, 1, function(x) ...) where ... is some function body that uses the x argument. I can't tell from your code what you want/need that to be, though.Thomas
@Thomas I edited the question. Is it clear now?Jaap
What do you expect which(distmat < ((df1$cat)^2) * 1000) to give you?Thomas
@Thomas I expect that it gives me a set of names which are within a certain distance ((df1$cat)^2)*1000) of the observation.Jaap
which is documented to give integers. Why would you think it gives names?IRTFM

1 Answers

1
votes

You need to operate on each row of df1 (or mat) in order to figure out, for each row how far away each object in df2 is. From that, you can pick the ones that meet your distance criterion.

I think you're getting a little confused about the use of apply and about the use of which. To really have which work for you, you need to apply it to each row of mat whereas your current code applies it to the entire mat matrix. Also note that it is hard to use apply here because you're comparing each row of mat against a corresponding element of a vector defined by ((df1$cat)^2)*1000). So, I will instead show you examples using sapply and lapply. You could also use mapply here, but I think the sapply/mapply syntax is clearer.

To address your desired output, I show two examples. One returns a list containing, for each row in df1, the names of items in df2 that are within the distance threshold. This won't easily go back into your original df1 as a variable because each element in the list can contain multiple names. The second example pastes those names together as a single comma-separated character string in order to create the new variable you're looking for.

Example 1:

out1 <- lapply(1:nrow(df1), function(x) {
    df2[which(mat[x,] < (((df1$cat)^2)*1000)[x]),'latlonloc']
})

Result:

> str(out1)
List of 20
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 2
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 4
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 6 4 5
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 
 $ : Factor w/ 6 levels "Boelenslaan",..: 4
 $ : Factor w/ 6 levels "Boelenslaan",..: 

Example 2:

out2 <- sapply(1:nrow(df1), function(x) {
    paste(df2[which(mat[x,] < (((df1$cat)^2)*1000)[x]),'latlonloc'], collapse=',')
})

Result:

> out2
 [1] ""                                 ""                                
 [3] ""                                 ""                                
 [5] ""                                 ""                                
 [7] ""                                 "Borgercompagnie"                 
 [9] ""                                 "Garrelsweer"                     
[11] ""                                 ""                                
[13] ""                                 ""                                
[15] "Tinallinge,Garrelsweer,Stitswerd" ""                                
[17] ""                                 ""                                
[19] "Garrelsweer"                      ""

I think the second of these is probably closest to what you're going for.