0
votes

What's the difference between %python and %pyspark in a Zeppelin notebook (screenshot below)?

  • I can run the same python commands with both cases (like print('hello'))
    • I can use the the same PySpark API in both cases
    • i.e. from pyspark.sql import SparkSession, and spark.read.csv
    • EDIT 10/31/2019 This is no longer true; in a %python interpreter I get the message No module named pyspark.
    • I guess I can install the missing module using pip install pyspark, but I don't know how to install onto a Zeppelin resource.
  • EDIT 10/31/2019 I must use a python interpreter, vs a python3 interpreter, else I get an error like: Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

    • Also, I guess this module was installed on Zeppelin when I used it earlier this year.
  • I can even toggle back and forth; using them both simultaneously?
    • i.e. first paragraph uses %python, next paragraph uses %pyspark
    • Nevermind; each language cannot see the variables defined by the other language...
    • They just have the same (Python) API, i.e. each can create their own dataframe spark.createDataFrame([...])
  • I see from the screenshot below that those languages use different interpeters:
    • %python language -> python interpreter
    • %pyspark language -> spark interpreter

...But what's the difference between using those interpreters, if my API / code is all the same? Is either of them faster/newer/better? Why use one over the other?

a zeppelin notebook settings dialog, showing 3 interpreters (spark, python, md) and in the spark interpreter a list of languages including %pyspark, and in python interpreter a list of languages including %python

1
%pyspark creates a spark context automatically with the defined parameters (loading spark packages, settings...) for the spark-interpreter. In %python you can create a spark context by your own but it is not down automatically. - cronoik
Yes, you mean %pyspark will implicitly define a spark variable / session for me, whereas with %python I must create the spark variable / session myself, manually? I see that difference saves me some work when using %pyspark, although I wonder what settings it uses to create the session (i.e. appName('...')) Thank you! If you post this as answer I will accept. - The Red Pea

1 Answers

1
votes

When you run a %pyspark paragraph, zeppelin will create a spark context (spark variable) automatically with the defined parameters (loading spark packages, settings...).* Have a look at the documentation) of the spark-interpreter for some of the possibilities.

In %python paragraph you can create a spark context by your own but it is not done automatically and will not use the defined parameters of the spark interpreter section.

That maybe still doesn't seem to be much, but zeppelin can handle multiple users (even if it is currently not perfect) and from an administrative perspective this becomes really handy. An administrator for example can define that every zeppelin user who wants to use spark (scala, R or python), gets the same defined environment (number of executors, memory, software packages of a certain version). It is still possible to workaround this restrictions, but at least you avoid unintentional configuration differences.

*For example:

%pyspark
spark

Would evaluate to this output output:

<SparkContext master=local[4] appName=ZeppelinHub>

and

%pyspark
spark

Would evaluate to this output:

<pyspark.sql.session.SparkSession at 0x7fe757ca1f60>