Unable to import pyspark in dataproc cluster on GCP

Question

I just setup a cluster on Google Cloud Platform to run some pyspark jobs. Initially I used ipython.sh (from the github repository) as initialization script for the cluster. This allowed the cluster to startup nicely, however when trying to import pyspark in an Ipython notebook, I got a "cannot import name accumulators" error.

After some searching, I was thinking this had something to do with the install path for pyspark not being included in my Python Path, so I deleted my cluster and wanted to create a new one, using jupyter.sh, as initialization script.

However, now my cluster wont startup at all, I get an error. Checking the log "dataproc-initialization-script-0_output" it simply says:

/usr/bin/env: bash : No such file or directory

Any ideas on what I'm missing here?

Edit:

I got the cluster to start with the public initialization script in gs://dataproc-initialization-actions/jupyter/jupyter.sh

However, I'm still running into the same issues when trying to load pyspark in a pyspark notebook, when I try something like "from pyspark import SparkConf" I get errors (weirdly enough, I get a different error if I run the same import a second time).

Any feedback at this stage?

This is the output of my notebook: html notebook output

Do you have a cluster_uuid or an operation_id you can share either here or privately with [email protected]? — Dennis Huo
The cluster_uuids are: cluster-1:192c22e4-e0f6-4970-8428-687327016c49 and cluster-1:a1218d27-1b5a-4c7f-97ec-71b34cf76b5f, thank you! — Fematich

Dennis Huo Dennis Huo · Accepted Answer · 2016-09-22T18:32:43

The most recent Jupyter initialization action for Dataproc was created to target Dataproc --image-version 1.0, so the change of default version to 1.1 which includes Spark 2.0.0 appears to have silently broken the PySpark kernel (unfortunately, instead of erroring out during deployment, the PySpark kernel simply fails to create the correct Spark environment).

A generous contributor actually did send a pull request awhile ago when Dataproc 1.1 was just about to become the default, but during review, Dataproc team wanted to refactor the script for better future-proofing without keeping separate forks of the kernel configs explicitly.

I went ahead and whipped up the refactoring pull request which allows a base kernel generator to work against both Dataproc 1.0 and Dataproc 1.1. As soon as that's merged, new clusters using the standard gs://dataproc-initialization-actions/jupyter/jupyter.sh will automatically start to work correctly. In the meantime, you can do one of two things:

Try reverting to Dataproc 1.0 / Spark 1.6.2:

gcloud dataproc clusters create --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh \
    --image-version 1.0

Try out my updates in-place (and thus keep the freshest Dataproc 1.1 + Spark 2.0.0) before it's merged into the upstream master:

gcloud dataproc clusters create --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh \
    --metadata INIT_ACTIONS_REPO=https://github.com/dennishuo/dataproc-initialization-actions.git,INIT_ACTIONS_BRANCH=dhuo-fix-jupyter-spark2

Unable to import pyspark in dataproc cluster on GCP

1 Answers