I have a dataset with about 2.4M rows, with a unique key for each row. I have performed some complex SQL queries on some other tables, producing a dataset with two columns, a key and the value true
. This dataset is about 500 rows. Now I would like to (outer) join this dataset with my original table.
This produces a new table with a very sparse set of values (true in about 500 rows, null elsewhere).
Finally, I would like to do this about 200 times, giving me a final table of about 201 columns (the key, plus the 200 sparse columns).
When I run this, I notice that as it runs it gets considerably slower. The first join takes 2 seconds, then 4s, then 6s, then 10s, then 20s and after about 30 joins the system never recovers. Of course, the actual numbers are irrelevant as that depends on the cluster I'm running, but I'm wondering:
- Is this slowdown is expected?
- I am using parquet as a data storage format (columnar storage) so I was hopeful that adding more columns would scale horizontally, is that a correct assumption?
- All the columns I've joined so far are not needed for the Nth join, can they be unloaded from memory?
- Are there other things I can do when combining lots of columns in spark?
- Calling
explain
on each join in the loop shows that each join is getting more complex (appears to include all previous joins and it also includes the complex sql queries, even though those have beencheckpointed
). Is there a way to really checkpoint so each join is just a join? I am actually callingshow()
after each join, so I assumed the join is actually happening at that point.