Joining two RDDs without common key based on a function in Spark

Question

I am trying to match user logins with the closest city in an efficient manner.

Starting with two RDDs with the following:

RDD1: checkin_id ,user_id, session_id, utc_time, timezone_offset, latitude, longitude, category, subcategory
RDD2: City_name, lat, lon, country_code, country, city_type

I would like to join these two to the following format based on the closest city as calculated by the haver-sin function.

checkin_id ,user_id, session_id, utc_time, timezone_offset, latitude, longitude, category, subcategory, City_name, country

In Scala I do this with a double for loop, but this is not allowed in Spark. I have tried to use the Cartesian( rdd1.Cartesian(rdd2) ) and then reducing, but this gives me a massive N*M matrix.

Is there a faster more space efficient way of joining these RDDs based on the shortest haver-sin distance?

Don't think there's anyway you can get around doing a cartesian product -- for every login you have to calculate the distance to every city, which means sort of by definition you need to do a cartesian product. — David Griffin

zero323 zero323 · Accepted Answer · 2016-03-28T00:03:03

One way to approach this is to completely avoid the join. Assuming that #cities << #user (in other words RDD1.count << RDD2.count) the most efficient approach to simply map over users:

convert RDD2 to a local data structure
convert it to a format which can be used for efficient geo-spatial queries (for example a K-d tree
broadcast it and use for mapping

If RDD2 is to large to be stored in memory but is small enough to be passed using a single file you can easily adjust this approach by replacing local data structure with solution like SpatiaLite:

write data to as database
distribute it to workers using standard Spark tools (SparkFiles)
map over users using queries over local database

Finally, if none of the above works for you, be smart about the way you join:

you can easily use latitude and longitude to map from user position to some local entity like a continent, country, local administrative entity. Use this information to perform initial join (obviously if user is somewhere in Europe checking Melbourne, Australia is pointless)
use tools like GeoHash to assign users and cities to a buckets which can be used for joins (it will require some adjustments in border cases - you may have to put a single object into multiple buckets if it is located near the equator or 180 degree meridian).

Joining two RDDs without common key based on a function in Spark

1 Answers