1
votes

I am trying to understand how Apache PySpark works. The video: Spark Python API - Josh Rosen says Python API is a wrapper over Java API. Internally it invokes Java methods. Check around timestamp 6.41

https://www.youtube.com/watch?v=mJXl7t_k0wE

This documentation says Java API is wrapper over Scala API

https://cwiki.apache.org/confluence/display/SPARK/Java+API+Internals

I have few questions as mentioned below:

1) So does that mean for each method such as map, reduce etc. in PySpark, it will invoke corresponding methods(say map) in Java and then Java code will invoke similar methods(map) in Scala. Actual execution will happen through scala code and results will be returned from Scala -> Java -> Python in reverse order again.

2) Also, the closures/functions which are used for "map" are those also sent from python -> java -> scala?

3) class RDD(object):

"""
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be
operated on in parallel.
"""

def __init__(self, jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer())):
    self._jrdd = jrdd
    self.is_cached = False
    self.is_checkpointed = False
    self.ctx = ctx
    self._jrdd_deserializer = jrdd_deserializer
    self._id = jrdd.id()
    self.partitioner = None

Does self._jrdd represent Java version of that particular RDD?

5) I am using PySpark in IntelliJ and have loaded source code from https://spark.apache.org/downloads.html.

Is it possible to debug down from PySpark till Scala API for any function invocation e.g "map" function? When I tried, I can see some java related functions are being invoked but after that cannot move forward in IntelliJ debug mode.

Any help/explanation/pointers will be appreciated.

1

1 Answers

2
votes

So does that mean for each method such as map, reduce etc. in PySpark, it will invoke corresponding methods(say map) in Java and then Java code will invoke similar methods(map) in Scala.

Yes and no. First of all Java and Scala compile to the same bytecode - at the moment when code is executed both are executed in the same context. Python is a bit different thing - with RDD internal mechanics is different from JVM languages and JVM serves mostly as a transport layer and the worker code is Python. With SQL there is no worker side Python.

Also, the closures/functions which are used for "map" are those also sent from python -> java -> scala?

Serialized versions are send via JVM, but execution context is Python

Does self._jrdd represent Java version of that particular RDD?

Yes, it does.

Is it possible to debug down from PySpark till Scala API for any function invocation e.g "map" function?

How can pyspark be called in debug mode?