Install pyspark on Google cloud Dataproc cause “could not find valid SPARK_HOME while searching['/tmp', '/usr/local/bin']”

Question

I create a cluster with Google Cloud Dataproc. I can submit job to the cluster just fine until I do

pip3 install pyspark

on the cluster. After that, each time I try to submit a job, I received an error:

Could not find valid SPARK_HOME while searching ['/tmp', '/usr/local/bin']
/usr/local/bin/spark-submit: line 27: /bin/spark-class: No such file or directory

I notice that even before pyspark was installed, SPARK_HOME was not set to anything. However I can submit the job just fine. I wonder why does install pyspark cause this problem and how to fix it?

if submit job was running successfully then why did you have to install pyspark for? Your manual installation conflicted with the configured pyspark — Ramesh Maharjan
@RameshMaharjan Yea... I just realized that, I tried to install pyspark because when I run python shell in there and it does not have pyspark module. Right now I am recreating my cluster and see if it will succeed. — user5574376
good luck . stackoverflow.com/questions/44248567/… should help you ;) — Ramesh Maharjan

Karthik Palaniappan Karthik Palaniappan · Accepted Answer · 2018-05-01T03:02:11

Pyspark is already pre-installed on Dataproc -- you should invoke the pyspark command rather than python. For now, trying to pip install pyspark or py4j will break pyspark on Dataproc. You also need to be careful not to pip install any packages that depend on pyspark/py4j. We're aware of this issue :)

If you're just trying to switch to Python 3, currently the easiest way to do that is to run the miniconda initialization action when creating your cluster: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/conda/. That init action conveniently also allows you to specify extra pip or conda packages to install.

We are also aware that pyspark isn't on PYTHONPATH for the python interpreter. For now, if you want to run pyspark code, use the pyspark command. Note that the pyspark command sources /etc/spark/conf/spark-env.sh, which you would have to do manually if you wanted to run import pyspark in a python shell.

Side note: rather than SSHing into the cluster and running pyspark, consider running gcloud dataproc jobs submit pyspark (docs) from your workstation or using Jupyter notebook.

Install pyspark on Google cloud Dataproc cause “could not find valid SPARK_HOME while searching['/tmp', '/usr/local/bin']”

2 Answers