When I write an RDD transformation, e.g.
val rdd = sc.parallelise(1 to 1000)
rdd.map(x => x * 3)
I understand that the closure (x => x * 3) which is simply a Function1 needs to be Serializable and I think I read somewhereEDIT: it's right there implied in the documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark that it is "sent" to the workers for execution. (e.g. Akka sending an "executable piece of code" down the wire to workers to run)
Is that how it works?
Someone at a meetup I attended commented and said that it is not actually sending any serialized code, but since each worker get a "copy" of the jar anyway, it just needs a reference to which function to run or something like this (but I'm not sure I quote that person correctly)
I'm now at an utter confusion on how it actually works.
So my questions are
how are transformation closures sent to workers? Serialized via akka? or they are "already there" because spark sends the entire uber jar to each worker (sounds unlikely to me...)
if so, then how the rest of the jar is sent to the workers? is this is what the "cleanupClosure" doing? e.g. sending only the relevant bytecode to the worker instead of the entire uberjar? (e.g. only dependent code to the closure?)
so to summarise, does spark, at any point, syncs the jars in the
--jarsclasspath with the workers somehow? or does it sends "just the right amount" of code to workers? and if it does send closures, are they being cached for the need of recalculation? or does it send the closure with the task every time a task is scheduled? sorry if this is silly questions but I really don't know.
Please add sources if you can for your answer, I couldn't find it explicit in the documentation, and I'm too wary to try and conclude it just by reading the code.