Pyspark is already pre-installed on Dataproc -- you should invoke the pyspark
command rather than python
. For now, trying to pip install pyspark or py4j will break pyspark on Dataproc. You also need to be careful not to pip install any packages that depend on pyspark/py4j. We're aware of this issue :)
If you're just trying to switch to Python 3, currently the easiest way to do that is to run the miniconda initialization action when creating your cluster: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/conda/. That init action conveniently also allows you to specify extra pip or conda packages to install.
We are also aware that pyspark
isn't on PYTHONPATH
for the python interpreter. For now, if you want to run pyspark code, use the pyspark
command. Note that the pyspark
command sources /etc/spark/conf/spark-env.sh
, which you would have to do manually if you wanted to run import pyspark
in a python
shell.
Side note: rather than SSHing into the cluster and running pyspark
, consider running gcloud dataproc jobs submit pyspark
(docs) from your workstation or using Jupyter notebook.