I am trying to match user logins with the closest city in an efficient manner.
Starting with two RDDs with the following:
- RDD1: checkin_id ,user_id, session_id, utc_time, timezone_offset, latitude, longitude, category, subcategory
- RDD2: City_name, lat, lon, country_code, country, city_type
I would like to join these two to the following format based on the closest city as calculated by the haver-sin function.
- checkin_id ,user_id, session_id, utc_time, timezone_offset, latitude, longitude, category, subcategory, City_name, country
In Scala I do this with a double for loop, but this is not allowed in Spark. I have tried to use the Cartesian( rdd1.Cartesian(rdd2) ) and then reducing, but this gives me a massive N*M matrix.
Is there a faster more space efficient way of joining these RDDs based on the shortest haver-sin distance?