I am new to Spark. can someone please clear my doubt:
Lets assume below is my code:
a = sc.textFile(filename)
b = a.filter(lambda x: len(x)>0 and x.split("\t").count("111"))
c = b.collect()
I hope below is what happens internally: (Please correct if my understanding is wrong)
(1) variable a will be saved as a RDD variable containing the expected txt file content
(2) The driver node breaks up the work into tasks and each task contains information about the split of the data it will operate on. Now these Tasks are assigned to worker nodes.
(3) when collection action (i.e collect() in our case) is invoked, the results will be returned to the master from different nodes, and saved as a local variable c.
Now I want to understand what difference below code makes:
a = sc.textFile(filename).collect()
b = sc.parallelize(a).filter(lambda x: len(x)>0 and x.split("\t").count("111"))
c = b.collect()
Could someone please clarify ?