Reading the instructions under this repo: Google Cloud Storage and BigQuery connectors I followed the below initialization action to create a new Dataproc cluster with a specific version of Google Cloud Storage and BigQuery connector installed:
gcloud beta dataproc clusters create christos-test \
--region europe-west1 \
--subnet <a subnet zone> \
--optional-components=ANACONDA,JUPYTER \
--enable-component-gateway \
--initialization-actions gs://<bucket-name>/init-scripts/v.0.0.1/connectors.sh \
--metadata gcs-connector-version=1.9.16 \
--metadata bigquery-connector-version=0.13.16 \
--zone europe-west1-b \
--master-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image=<an-image> \
--project=<a-project-id> \
--service-account=composer-dev@vf-eng-ca-nonlive.iam.gserviceaccount.com \
--no-address \
--max-age=5h \
--max-idle=1h \
--labels=<owner>=christos,<team>=group \
--tags=allow-internal-dataproc-dev,allow-ssh-from-management-zone,allow-ssh-from-management-zone2 \
--properties=core:fs.gs.implicit.dir.repair.enable=false
As you should be able to see, I had to add the external dependencies in a bucket of my own under: gs://init-dependencies-big-20824/init-scripts/v.0.0.1/connectors.sh. As per the scipt's instructions (I am referring to the connector.sh script), I also had to add the following jars in this bucket:
- gcs-connector-hadoop2-1.9.16.jar
- gcs-connector-1.7.0-hadoop2.jar
- gcs-connector-1.8.0-hadoop2.jar
- bigquery-connector-hadoop2-0.13.16.jar
The script works fine and the cluster is created successfully. However, using a PySpark notebook through Jupyter still results in a BigQuery "class not found" exception. The same happens when I run PySpark directly from the terminal. The only way I was able to avoid that exception is by copying another jar (this time spark-bigquery_2.11-0.8.1-beta-shaded.jar) in my cluster's master node and starting PySpark with:
pyspark --jars spark-bigquery_2.11-0.8.1-beta-shaded.jar
Obviously, this beats the purpose.
What am I doing wrong? I thought about changing the connector.sh script to include another copy function so copy spark-bigquery_2.11-0.8.1-beta-shaded.jar under /usr/lib/hadoop/lib, so I tried to just copy this jar there manually and start PySpark but this still didn't work...