I am trying to understand how Apache PySpark works. The video: Spark Python API - Josh Rosen says Python API is a wrapper over Java API. Internally it invokes Java methods. Check around timestamp 6.41
https://www.youtube.com/watch?v=mJXl7t_k0wE
This documentation says Java API is wrapper over Scala API
https://cwiki.apache.org/confluence/display/SPARK/Java+API+Internals
I have few questions as mentioned below:
1) So does that mean for each method such as map, reduce etc. in PySpark, it will invoke corresponding methods(say map) in Java and then Java code will invoke similar methods(map) in Scala. Actual execution will happen through scala code and results will be returned from Scala -> Java -> Python in reverse order again.
2) Also, the closures/functions which are used for "map" are those also sent from python -> java -> scala?
3) class RDD(object):
"""
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be
operated on in parallel.
"""
def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer())):
self._jrdd = jrdd
self.is_cached = False
self.is_checkpointed = False
self.ctx = ctx
self._jrdd_deserializer = jrdd_deserializer
self._id = jrdd.id()
self.partitioner = None
Does self._jrdd represent Java version of that particular RDD?
5) I am using PySpark in IntelliJ and have loaded source code from https://spark.apache.org/downloads.html.
Is it possible to debug down from PySpark till Scala API for any function invocation e.g "map" function? When I tried, I can see some java related functions are being invoked but after that cannot move forward in IntelliJ debug mode.
Any help/explanation/pointers will be appreciated.