2
votes

I was wondering what the most efficient method of calculating the distance in miles between two US zipcode columns would be using R.

I have heard of the geosphere package for computing the difference between zipcodes but do not fully understand it and was wondering if there were alternative methods as well.

For example say I have a data frame that looks like this.

 ZIP_START     ZIP_END
 95051         98053
 94534         94128
 60193         60666
 94591         73344
 94128         94128
 94015         73344
 94553         94128
 10994         7105
 95008         94128

I want to create a new data frame that looks like this.

 ZIP_START     ZIP_END     MILES_DIFFERENCE
 95051         98053       x
 94534         94128       x
 60193         60666       x
 94591         73344       x
 94128         94128       x
 94015         73344       x
 94553         94128       x
 10994         7105        x
 95008         94128       x

Where x is the difference in miles between the two zipcodes.

What is the best method of calculating this distance?

Here is the R code to create the example data frame.

df <- data.frame("ZIP_START" = c(95051, 94534, 60193, 94591, 94128, 94015, 94553, 10994, 95008), "ZIP_END" = c(98053, 94128, 60666, 73344, 94128, 73344, 94128, 7105, 94128))

Please let me know if you have any questions.

Any advice is appreciated.

Thank you for your help.

3
given "I have heard of the geosphere package for computing the difference between zipcodes", what examples have you seen that does this, what have you tried, and what isn't working? Questions on SO which appear to be simply asking for someone to do your work don't get a lot of attention (and get down-voted). SO is for asking for programming help, on a program you have written.SymbolixAU
There are several web services to do this, but their APIs are usually limited for free use and/or require registration. But given that there is a zipcode package (with latitude & longitude for every zip code), you should try to understand the distHaversine method in geosphere. It isn't very complicated - here's a code example.neilfws

3 Answers

8
votes

There is a handy R package out there named "zipcode" which provides a table of zip code, city, state and the latitude and longitude. So once you have that information, the "geosphere" package can calculate the distance between points.

library(zipcode)
library(geosphere)

#dataframe need to be character arrays or the else the leading zeros will be dropped causing errors
df <- data.frame("ZIP_START" = c(95051, 94534, 60193, 94591, 94128, 94015, 94553, 10994, 95008), 
       "ZIP_END" = c(98053, 94128, 60666, 73344, 94128, 73344, 94128, "07105", 94128), 
       stringsAsFactors = FALSE)

data("zipcode")

df$distance_meters<-apply(df, 1, function(x){
  startindex<-which(x[["ZIP_START"]]==zipcode$zip)
  endindex<-which(x[["ZIP_END"]]==zipcode$zip)
  distGeo(p1=c(zipcode[startindex, "longitude"], zipcode[startindex, "latitude"]), p2=c(zipcode[endindex, "longitude"], zipcode[endindex, "latitude"]))
})

Warning about your column class for your input data frame. Zip codes should be a character and not numeric, otherwise leading zeros are dropped causing errors.

The return distance from distGeo is in meters, I will allow the reader to determine the proper unit conversion to miles.

Update
The zipcode package appears to have been archived. There is a replacement package: "zipcodeR" which provides the longitude and latitude data along with addition information.

1
votes

As Dave2e mentioned the original zipcode package was already removed from CRAN so we need use zipcodeR instead.

if (!require("zipcodeR"))install.packages("zipcodeR")
if (!require("geosphere"))install.packages("geosphere")

df <- data.frame(
  "ZIP_START" = c(95051, 94534, 60193, 94591, 94128, 94015, 94553, 10994, 95008),
  "ZIP_END" = c(98053, 94128, 60666, 73344, 94128, 73344, 94128, "07105", 94128),
  stringsAsFactors = FALSE
)

data("zip_code_db")

df$distance_meters<-apply(df, 1, function(x){
  startindex<-which(x[["ZIP_START"]]==zip_code_db$zipcode)
  endindex<-which(x[["ZIP_END"]]==zip_code_db$zipcode)
  distGeo(p1=c(zip_code_db[startindex, "lng"], 
               zip_code_db[startindex, "lat"]), 
          p2=c(zip_code_db[endindex, "lng"], 
               zip_code_db[endindex, "lat"]))
})

Here's a fix based on new zipcodeR package. And the credit goes to Dave2e.

1
votes

The OP asks for "most efficient", so given

  • geosphere is quite slow when you want to use it on lots of data
  • apply is a essentially a looping function and can often be beaten using vectorised code

I propose a fully vectorised solution using data.table and library(geodist)


#dataframe need to be character arrays or the else the leading zeros will be dropped causing errors
df <- data.frame("ZIP_START" = c(95051, 94534, 60193, 94591, 94128, 94015, 94553, 10994, 95008), 
                 "ZIP_END" = c(98053, 94128, 60666, 73344, 94128, 73344, 94128, "07105", 94128), 
                 stringsAsFactors = FALSE)


library(zipcodeR)
library(data.table)
library(geodist)

## Convert the zip codes to data.table so we can join on them
## I'm using the centroid of the zipcodes (lng and lat).
## If you want the distance to the endge of the zipcode boundary you'll
## need to convert this into a spatial data set
dt_zips <- as.data.table( zip_code_db[, c("zipcode", "lng", "lat")])

## convert the input data.frame into a data.talbe
setDT( df )

## the postcodes need to be characters
df[
  , `:=`(
    ZIP_START = as.character( ZIP_START )
    , ZIP_END = as.character( ZIP_END )
  )
]

## Attach origin lon & lat using a join
df[
  dt_zips
  , on = .(ZIP_START = zipcode)
  , `:=`(
    lng_start = lng
    , lat_start = lat
  )
]

## Attach destination lon & lat using a join
df[
  dt_zips
  , on = .(ZIP_END = zipcode)
  , `:=`(
    lng_end = lng
    , lat_end = lat
  )
]

## calculate the distance
df[
  , distance_metres := geodist::geodist_vec(
    x1 = lng_start
    , y1 = lat_start
    , x2 = lng_end
    , y2 = lat_end
    , paired = TRUE
    , measure = "haversine"
  )
]

## et voila - note the missing zipcode 6066 and 73344
df

#    ZIP_START ZIP_END lng_start lat_start lng_end lat_end distance_metres
# 1:     95051   98053   -121.98     37.35 -122.02   47.66      1147708.60
# 2:     94534   94128   -122.10     38.20 -122.38   37.62        69090.01
# 3:     60193   60666    -88.09     42.01      NA      NA              NA
# 4:     94591   73344   -122.20     38.12      NA      NA              NA
# 5:     94128   94128   -122.38     37.62 -122.38   37.62            0.00
# 6:     94015   73344   -122.48     37.68      NA      NA              NA
# 7:     94553   94128   -122.10     38.00 -122.38   37.62        48947.02
# 8:     10994   07105    -73.97     41.10  -74.15   40.72        44930.17
# 9:     95008   94128   -121.94     37.28 -122.38   37.62        54263.61

Also note the returned distance is given in metres.