I am new to Spark and we currently have a pretty small cluster (14 nodes, 48 cores per node). I have 2 Dataframes. One is 41 million records (customers) and the other is 100k (locations). Each table has latitude and longitude and the location table has a handful of attributes. I want to calculate the distance between each customer and each location and then summarize the additional attributes of the location for each customer for those that are within 15 miles.
I can of course create a join between the tables where I calculate the distance and then filter (or include the distance criteria in the 'on' clause). But this cartesian product is very large and never finishes.
Are there any general settings for Spark to consider here? Any methods better than others (RDDs with cartesian versus DF join)? I realize this is rather general question but I am looking for any best practices, setting to consider, #partitions, things to try etc.