spark avoid collect as much as possible

Question

I saw that a general recommendation for anyone using spark (in my case with Scala) is to avoid any action that gets all data from executers to driver (collect, count, sum etc). However, when I tried to use spark statistic library http://spark.apache.org/docs/2.2.0/ml-statistics.html I found out that the correlation matrix and ChiSquareTest methods expect array or matrix obtained of array\seq as their parameters, hence I do not see how I could avoid do collect to the dataframe (and some more manipulation I assume to make it of Vectors not Row type) if I want to use this functions. Will appreciate any help.

It says avoid when you could, not completely bypass it. Collect operations are needed to get any relevant output. — waterbyte
I've read that because of the driver might collapse in case of big data frames, it is not recommend to do it in production. However, I'm not sure if I try to do many manipulation like groupBy and Joins on the dataframe, isn't it also expensive? I try to understand what is the better approach in general. — Ron_ad

Rayan Ral Rayan Ral · Accepted Answer · 2020-07-13T19:12:29

In your example, both Correlation.corr and ChiSquareTest.test are accepting a dataframe, so you don't need to collect data before passing it to these functions. Results of these functions you'll have to collect on the driver, but that shouldn't cause any issues, as output size should be much smaller than the initial dataset, and it should easily fit into the driver's memory.
To your question in the comment about groupBy / joins - those are "expensive", but for a different reason. Grouping and joins leads to data shuffling - so, your workers would need to send a lot of data across the network, which takes much more time, than local data processing. Though, still, if you have to do this - it's ok to do it, just be aware of performance implications.
collect method is not recommended to use on a full dataset, as it may lead to an OOM error on the driver (imagine, that you had 50 Gb dataset, distributed over a cluster, and now you're collecting it on a single node), but if you've already processed your data, and you know, that there would be some reasonable amount of rows - it's pretty safe to do it. count should not be a problem from a memory standpoint at all, as it just returns number of rows in your dataset, instead of sending all of them to a driver node.

spark avoid collect as much as possible

1 Answers