3
votes

I am trying to match user logins with the closest city in an efficient manner.

Starting with two RDDs with the following:

  • RDD1: checkin_id ,user_id, session_id, utc_time, timezone_offset, latitude, longitude, category, subcategory
  • RDD2: City_name, lat, lon, country_code, country, city_type

I would like to join these two to the following format based on the closest city as calculated by the haver-sin function.

  • checkin_id ,user_id, session_id, utc_time, timezone_offset, latitude, longitude, category, subcategory, City_name, country

In Scala I do this with a double for loop, but this is not allowed in Spark. I have tried to use the Cartesian( rdd1.Cartesian(rdd2) ) and then reducing, but this gives me a massive N*M matrix.

Is there a faster more space efficient way of joining these RDDs based on the shortest haver-sin distance?

1
Don't think there's anyway you can get around doing a cartesian product -- for every login you have to calculate the distance to every city, which means sort of by definition you need to do a cartesian product.David Griffin

1 Answers

1
votes

One way to approach this is to completely avoid the join. Assuming that #cities << #user (in other words RDD1.count << RDD2.count) the most efficient approach to simply map over users:

  • convert RDD2 to a local data structure
  • convert it to a format which can be used for efficient geo-spatial queries (for example a K-d tree
  • broadcast it and use for mapping

If RDD2 is to large to be stored in memory but is small enough to be passed using a single file you can easily adjust this approach by replacing local data structure with solution like SpatiaLite:

  • write data to as database
  • distribute it to workers using standard Spark tools (SparkFiles)
  • map over users using queries over local database

Finally, if none of the above works for you, be smart about the way you join:

  • you can easily use latitude and longitude to map from user position to some local entity like a continent, country, local administrative entity. Use this information to perform initial join (obviously if user is somewhere in Europe checking Melbourne, Australia is pointless)
  • use tools like GeoHash to assign users and cities to a buckets which can be used for joins (it will require some adjustments in border cases - you may have to put a single object into multiple buckets if it is located near the equator or 180 degree meridian).