Initialize virtual environment from requirements.txt while submitting PySpark job to Google Dataproc

Question

I wanted to submit a PySpark job in a Dataproc cluster running Python 3 by default. I wanted to initialize the environment with the virtual env I have.

I tried two ways, One is to zip the entire venv as and upload it as archive and submit it to the cluster. But my job was not able to find the dependencies. e.g

gcloud dataproc jobs submit pyspark --project=** --region=** --cluster=** \
  --archives gs://**/venv.zip#venv \
  --properties spark.pyspark.driver.python=venv/bin/python \
  gs://****.main.py

Second method was that I tried to tell spark to create a virtual env for me and install the requirements from the requirements file provided to me as mentioned in the link

Pyspark with Virtual env

But both the approach failed. Can anyone help? Plus I don't want to go the post initialization script way of Dataproc. I would really want to avoid that.

David Rabinowitz David Rabinowitz · Accepted Answer · 2020-05-12T23:00:11

Would installing the requirements on the cluster help you? Starting from Dataproc image 1.4, you can add the requirements upon cluster creation:

REGION=<region>
gcloud dataproc clusters create my-cluster \ 
    --image-version 1.4 \
    --metadata 'CONDA_PACKAGES=scipy=1.1.0 tensorflow' \ 
    --metadata 'PIP_PACKAGES=pandas==0.23.0 scipy==1.1.0' \ 
    --initialization-actions \ 
    gs://goog-dataproc-initialization-actions-${REGION}/python/conda-install.sh,gs://goog-dataproc-initialization-actions-${REGION}/python/pip-install.sh

You can also install the full Anaconda by adding the following parameter: --optional-components=ANACONDA to the cluster creation

Initialize virtual environment from requirements.txt while submitting PySpark job to Google Dataproc

1 Answers