Cartesian Join (PY)Spark Best Practices

Question

I am new to Spark and we currently have a pretty small cluster (14 nodes, 48 cores per node). I have 2 Dataframes. One is 41 million records (customers) and the other is 100k (locations). Each table has latitude and longitude and the location table has a handful of attributes. I want to calculate the distance between each customer and each location and then summarize the additional attributes of the location for each customer for those that are within 15 miles.

I can of course create a join between the tables where I calculate the distance and then filter (or include the distance criteria in the 'on' clause). But this cartesian product is very large and never finishes.

Are there any general settings for Spark to consider here? Any methods better than others (RDDs with cartesian versus DF join)? I realize this is rather general question but I am looking for any best practices, setting to consider, #partitions, things to try etc.

Alper t. Turker Alper t. Turker · Accepted Answer · 2018-04-11T07:51:50

General answer to general question:

Cartesian product is a brute-force solution - it doesn't work well on small data, and definitely not scale.
Location data is orders of magnitude smaller (I assume it doesn't contain more than a 1KB of data per record or so). Use that for your advantage. Either use broadcast joins (if data is smallish, up to few GB) or distribute it as a file to each node and read it from there (up to 100GB or so).
There are well established structures and tools for querying geospatial data. Use these to avoid brute force search. At minimum you can use local k-d tree to quickly search for nearest neighbors.
Even if your data grows, you can still leverage basic its basic properties. For example:
- Define 15 miles x 15 miles grid.
- Assign each customer to a square.
- Assign each location to an actual square, as well to each of 8 adjacent squares (customer can be < 15 miles if and only if it is in the same square, or adjacent square). This will multiply location data 9 fold.
- Join based on the grid membership - data is larger, but it can be done with hash join and doesn't require Cartesian product. Drop duplicates.

Cartesian Join (PY)Spark Best Practices

1 Answers