How to add spark-csv package to jupyter server on Azure for use with iPython

votes

I want to use the spark-csv package from https://github.com/databricks/spark-csv from within the jupyter service running on Spark HDInsight cluster on Azure.

From local cluster I know I can do this like:

export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"

However I don't understand/know where to put this in the Azure spark configuration.. Any clues hints are appreciated.

azureapache-sparkpysparkazure-hdinsight

4 Answers

votes

You can use the %%configure magic to add any required external package. It should be as simple as putting the following snippet in your first code cell.

%%configure
{ "packages":["com.databricks:spark-csv_2.10:1.4.0"] }

This specific example is also covered in the documentation. Just make sure you start Spark session after the %%configure cell.

votes

One option for managing Spark packages in a cluster from a Jupyter notebook is Apache Toree. Toree gives you some extra line magics that allow you to manage Spark packages from within a Jupyter notebook. For example, inside a Jupyter scala notebook, you would install spark-csv with

%AddDeps com.databricks spark-csv_2.11 1.4.0 --transitive

To install Apache Toree on your Spark clusters, ssh into you Spark clusters and run,

sudo pip install --pre toree
sudo jupyter toree install \
   --spark_home=$SPARK_HOME \
   --interpreters=PySpark,SQL,Scala,SparkR

I know you specifically asked about Jupyter notebooks running PySpark. At this time, Apache Toree is an incubating project. I have run into trouble using the provided line magics with pyspark notebooks specifically. Maybe you will have better luck. I am looking into why this is, but personally, I prefer Scala in Spark. Hope this helps!

votes

You can try to execute your two lines of code (export ...) in a script that you can invoke in Azure at the time of creation of the HDInsight cluster.

votes

Since you are using HDInsight, you can use a "Script Action" on the Spark cluster load that imports the needed libraries. The script can be a very simple shell script and it can be automatically executed on startup, and automatically re-executed on new nodes if the cluster is resized.

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/