3
votes

I had created a dataproc cluster with Anaconda as optional component and created a virtual env. in that. Now when running a pyspark py file on master node I'm getting this error -

Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I need RDKit package inside the virtual env. and with that python 3x version gets installed. The following commands on my master node and then the python version changes.

conda create -n my-venv -c rdkit rdkit=2019.*   
conda activate my-venv
conda install -c conda-forge rdkit

How can I solve this?

1

1 Answers

1
votes

There's a few things here:

The 1.3 (default) image uses conda with Python 2.7. I recommend switching to 1.4 (--image-version 1.4) which uses conda with Python 3.6.

If this library will be needed on the workers you can use this initialization action to apply the change consistently to all nodes.

Pyspark does not currently support virtualenvs, but this support is coming. Currently you can run pyspark program from within a virtualenv, but this will not mean workers will run inside the virtualenv. Is it possible to apply your changes to the base conda environment without virtualenv?

Additional info can be found here https://cloud.google.com/dataproc/docs/tutorials/python-configuration