I wanted to submit a PySpark job in a Dataproc cluster running Python 3 by default. I wanted to initialize the environment with the virtual env I have.
I tried two ways, One is to zip the entire venv as and upload it as archive and submit it to the cluster. But my job was not able to find the dependencies. e.g
gcloud dataproc jobs submit pyspark --project=** --region=** --cluster=** \
--archives gs://**/venv.zip#venv \
--properties spark.pyspark.driver.python=venv/bin/python \
gs://****.main.py
Second method was that I tried to tell spark to create a virtual env for me and install the requirements from the requirements file provided to me as mentioned in the link
But both the approach failed. Can anyone help? Plus I don't want to go the post initialization script way of Dataproc. I would really want to avoid that.