Import PySpark error in Anaconda venv on Dataproc

Question

I have spun up a Dataproc cluster with Anaconda as the additional component. I have created a virtual env. in anaconda and installed RDkit inside it. Now my issue is that when I open up python terminal and try to do this:

from pyspark import SparkContext

It throws error:

Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'pyspark'

I can install the PySpark inside the Anaconda venv and then it works but I wanted to use the pre-installed PySpark on Dataproc. How to resolve this?

pip install pyspark with a specific version. Note : Please remember to set paths (JAVA, HADOOP, SPARK) before using it. — Ghost
Dataproc clusters with Anaconda component will have an environment "base" which has pyspark. Is there any reason you do not want to use this env? — cyxxy
I need to install RDkit in the Anaconda venv. it errors out when installing in "base" env. and now because of this some of the Pyspark commands are failing in conda "my-env" when I installed pyspark in the "my-env". — sopana
can you add more details then? What is the Dataproc image version? What commands did you run to create your own venv, install rdkit, etc. — cyxxy
I solved this by mentioning the version no. in "my-env" pip install pyspark==2.3.4 and then the errors stopped. — sopana

Igor Dvorzhak Igor Dvorzhak · Accepted Answer · 2020-02-01T03:55:38

To use Dataproc's PySpark in a new Conda environment you need to install file:///usr/lib/spark/python package inside this environment:

conda create -c rdkit -n rdkit-env rdkit
conda activate rdkit-env
sudo "${CONDA_PREFIX}/bin/pip" install -e "file:///usr/lib/spark/python"

Import PySpark error in Anaconda venv on Dataproc

1 Answers