1
votes

I have an 100X100 correlation matrix with zip codes as the column and row names. I also have a data frame that contains the latitude and longitude for all zipcdes and a function that calculates the distance based on lat and long.

Here is a snippet of the correlation matrix

            08846       48186       90621      92602       92701       92702       92703      92705      92706      92712
08846  1.00000000 -0.18704668  0.17631080 -0.0195590 -0.08640209 -0.09109788 -0.04251868 -0.1586506 -0.0778115 -0.0572327
48186 -0.18704668  1.00000000 -0.09365048  0.1616530  0.20468051  0.17682056  0.18009911  0.1417840  0.1958971  0.1938676
90621  0.17631080 -0.09365048  1.00000000  0.5880756  0.75200501  0.74694849  0.76071605  0.6593806  0.7640519  0.7657806
92602 -0.01955900  0.16165299  0.58807565  1.0000000  0.88187818  0.88947447  0.89310793  0.9615530  0.8926566  0.8926482
92701 -0.08640209  0.20468051  0.75200501  0.8818782  1.00000000  0.99314798  0.98011569  0.9294281  0.9827633  0.9886139
92702 -0.09109788  0.17682056  0.74694849  0.8894745  0.99314798  1.00000000  0.98791442  0.9470895  0.9853157  0.9933086
92703 -0.04251868  0.18009911  0.76071605  0.8931079  0.98011569  0.98791442  1.00000000  0.9321385  0.9938496  0.9981231
92705 -0.15865058  0.14178399  0.65938061  0.9615530  0.92942815  0.94708954  0.93213849  1.0000000  0.9268797  0.9357917
92706 -0.07781150  0.19589706  0.76405191  0.8926566  0.98276329  0.98531570  0.99384961  0.9268797  1.0000000  0.9948550
92712 -0.05723270  0.19386757  0.76578065  0.8926482  0.98861389  0.99330864  0.99812312  0.9357917  0.9948550  1.0000000

Here is snippet of the table of zip codes

    zip       city state latitude longitude
1 00210 Portsmouth    NH  43.0059  -71.0132
2 00211 Portsmouth    NH  43.0059  -71.0132
3 00212 Portsmouth    NH  43.0059  -71.0132
4 00213 Portsmouth    NH  43.0059  -71.0132
5 00214 Portsmouth    NH  43.0059  -71.0132
6 00215 Portsmouth    NH  43.0059  -71.0132

And here is the function taht calculates distance bwteen lat and long.

Calc_Dist <- function (long1, lat1, long2, lat2)
{
  rad <- pi/180
  a1 <- lat1 * rad
  a2 <- long1 * rad
  b1 <- lat2 * rad
  b2 <- long2 * rad
  dlon <- b2 - a2
  dlat <- b1 - a1
  a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
  c <- 2 * atan2(sqrt(a), sqrt(1 - a))
  R <- 6378.145
  d <- R * c
  return(d)
}

My goal here is to subset the correlation matrix to only include zip codes that are more than 500 miles apart (right now the distance calculation outputs in kilometers but that can be easily changed and is immaterial right now). The less expensive the better as I may have to do this with larger correlation matrices (~10000 x 10000). Any suggestions?

Thanks in advance, Ben

1
Any tips? I'm really struggling with a good way to connect the three data sourcesben890
Any feedback? Not sure what you ended up with.Mike.Gahan
My apologies I meant to accept your answer. Thanks for the help!ben890

1 Answers

2
votes

Is it critical that you have to use that distance function? I think the dist should be much more efficient.

#Making your zip.table a data.table helps us with speed
library(reshape)
library(data.table)
setDT(zip.table) 

#Calculate distance matrix and put into table form
setorder(zip.dist,zip)
zip.dist <- dist(zip.table[,.(longitude=abs(longitude),latitude)])
zip.dist <- as.matrix(zip.dist)
zip.dist <- melt(zip.dist)[melt(upper.tri(zip.dist))$value,]
setDT(zip.dist)
setnames(zip.dist,c("zip1", "zip2", "distance"))

#Do a very similar procedure with your correlation matrix
#It is important that you sorted your zip.table by zip before applying `cor`
zip.corr <- as.matrix(zip.corr)
zip.corr <- melt(zip.corr)[melt(upper.tri(zip.corr))$value,]
setDT(zip.corr)
setnames(zip.corr,c("zip1", "zip2", "cor"))

#Subset zip.dist to only include zip codes more than 500 miles apart
zip.dist <- zip.dist[distance*69 > 500] #69 mile ~ 1 degreen lat/lon

#Merge together
setkey(zip.dist,zip1,zip2)
setkey(zip.corr,zip1,zip2)
result.table <- zip.dist[zip.corr, nomatch=0]

Since these places are all pretty close to one another, I don't think you lose much by using euclidean distance. Especially since it is one lat/lon inside of a large county.