I just setup a cluster on Google Cloud Platform to run some pyspark jobs. Initially I used ipython.sh (from the github repository) as initialization script for the cluster. This allowed the cluster to startup nicely, however when trying to import pyspark in an Ipython notebook, I got a "cannot import name accumulators" error.
After some searching, I was thinking this had something to do with the install path for pyspark not being included in my Python Path, so I deleted my cluster and wanted to create a new one, using jupyter.sh, as initialization script.
However, now my cluster wont startup at all, I get an error. Checking the log "dataproc-initialization-script-0_output" it simply says:
/usr/bin/env: bash : No such file or directory
Any ideas on what I'm missing here?
Edit:
I got the cluster to start with the public initialization script in gs://dataproc-initialization-actions/jupyter/jupyter.sh
However, I'm still running into the same issues when trying to load pyspark in a pyspark notebook, when I try something like "from pyspark import SparkConf" I get errors (weirdly enough, I get a different error if I run the same import a second time).
Any feedback at this stage?
This is the output of my notebook: html notebook output
cluster-1:192c22e4-e0f6-4970-8428-687327016c49
andcluster-1:a1218d27-1b5a-4c7f-97ec-71b34cf76b5f
, thank you! – Fematich