I have spun up a Dataproc cluster with Anaconda as the additional component. I have created a virtual env. in anaconda and installed RDkit inside it. Now my issue is that when I open up python terminal and try to do this:
from pyspark import SparkContext
It throws error:
Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'pyspark'
I can install the PySpark inside the Anaconda venv and then it works but I wanted to use the pre-installed PySpark on Dataproc. How to resolve this?