Connect GCP with PySpark without using Dataproc

Question

I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks

David Rabinowitz David Rabinowitz · Accepted Answer · 2019-10-31T23:18:17

Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.

In short, you should run

spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()

Please refer here on how to create the json credentials file if needed.

Connect GCP with PySpark without using Dataproc

2 Answers