Spark - MLlib computeSVD executing on driver

Question

I am running RowMatrix.computeSVD using scala, in UI it appears that one stage only the "treeAggregate" is running on the cluster and after that the UI of the application master shows nothing while the application continues to execute the computeSVD. so i am assuming that only the "treeAggregate" is running on the cluster and the rest on the driver.

Is there a way to let all the compute SVD to run on the cluster? the Driver normally has limited resources and computeSVD take a long time for a matrix of 9446*9446.

zero323 zero323 · Accepted Answer · 2016-08-25T19:42:04

Unfortunately it looks like modifying strategy is not possible without tinkering with private API.

Depending on the number of columns and k Spark automatically adjusts computation strategy and fully distributed mode with multiple passes is used only if both numbers are large and k is relatively high compared to the number of columns.

At the first glance you could trigger distributed computation by keeping k between nCol / 3 and ncol / 2.

Spark - MLlib computeSVD executing on driver

1 Answers