1
votes

How does one import pyspark in google-cloud-datalab notebook? Even after setting up PYTHONPATH, SPARK_HOME on node, it doesn't work? Am I missing anything?

ImportErrorTraceback (most recent call last)  
  <ipython-input-4-c15ae3402d12> in <module>()
     ----> 1 import pyspark

ImportError: No module named pyspark
3

3 Answers

1
votes

As Fematich said, it's not supported yet unfortunately. However, Datalab is open source, if you feel like it you could modify the Dockerfile to add pyspark and build your own image. You could also send a pull request if you think that's something other people might be interested in as well.

1
votes

You can run Datalab conveniently on Cloud Dataproc via an initialisation action:

https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/datalab

This will allow you to interact with the pySpark environment.

Alternatively, you can edit the Dataproc Docker image to include spark (with pyspark). This will allow you to run Datalab with spark anywhere you with (locally or VMs).

0
votes

Datalab currently doesn't support (py)Spark yet (also check their roadmap). On Google Cloud Platform the easiest option at this moment is to deploy a DataProc cluster with Jupyter notebook, see the documentation here.

Note that the Dataproc team is also on StackOverflow, so he will be able to give you more information about the roadmap.