Apache Spark Correlation only runs on driver

Question

I am new to Spark and learn that transformations happen on workers and action on the driver but the intermediate action can happen(if the operation is commutative and associative) at the workers also which gives the actual parallelism.

I looked into the correlation and covariance code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

How could I find what part of the correlation has happened at the driver and what at executor?

Update 1: The setup I'm talking about to run the correlation is the cluster setup consisting of multiple VM's. Look here for the images from the SparK web UI: Distributed cross correlation matrix computation

Update 2

I setup my cluster in standalone mode like It was a 3 Node cluster, 1 master/driver(actual machine: workstation) and 2 VM slaves/executor. submitting the job like this ./bin/spark-submit --master spark://192.168.0.11:7077 examples/src/main/python/mllib/correlations_example.py from master node

My correlation sample file is correlations_example.py:

data = sc.parallelize(np.array([range(10000000), range(10000000, 20000000),range(20000000, 30000000)]).transpose()) 
print(Statistics.corr(data, method="pearson")) 
sc.stop()

I always get a sequential timeline as :

Doesn't it mean that it not happening in parallel based on timeline of events ? Am I doing something wrong with the job submission or correlation computation in Spark is not parallel?

Update 3: I tried even adding another executor, still the same seqquential treeAggreagate. I set the spark cluster as mentioned here: http://paxcel.net/blog/how-to-setup-apache-spark-standalone-cluster-on-multiple-machine/

I don't understand your update. So what is the question now ? — eliasah
No, this question is about the Spark implementation of correlation based on looking at the code and finding what happens at the driver and what on the executor. The question I linked is about my experiment. — Roshan Mehta
Let's say that everything happens on the executor in this matter except the returning result which in breeze Matrix. — eliasah

ganeiy ganeiy · Accepted Answer · 2017-06-29T14:15:36

Your statement is not entirely accurate. The container[executor] for the driver is launched on the client/edge node or on the cluster, depending on the spark submit mode e.g. client or yarn. The actions are executed by the workers and the results are sent back to the driver (e.g. collect)

This has been answered already. See link below for more details. When does an action not run on the driver in Apache Spark?

Apache Spark Correlation only runs on driver

1 Answers