I have total confusion in the spark execution process. I have referred may articles and tutorials, nobody is discussing in detailed. I might be wrongly understanding spark. Please correct me.
I have my file of 40GB distributed across 4 nodes (10GB each node) of the 10 node cluster.
When I say spark.read.textFile("test.txt") in my code, will it load data(40GB) from all the 4 nodes into driver program (master node)?
Or this RDD will be loaded in all the 4 nodes separately. In that case, each node RDD should hold 10GB of physical data, is it?
And the whole RDD holds 10GB data and perform tasks for each partition i.e 128MB in spark 2.0. And finally shuffles the output to the driver program (master node)
And I read somewhere "numbers of cores in Cluster = no. of partitions" does it mean, the spark will move the partitions of one node to all 10 nodes for processing?