0
votes

I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks

2

2 Answers

0
votes

Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.

In short, you should run

spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()

Please refer here on how to create the json credentials file if needed.

0
votes

The BigQuery connector is available in a jar file as spark-bigquery-connector, it is publicly available. Then you can:

  • Add it to the classpath on your on-premise/self-hosted cluster, so your applications can reach the BigQuery API.
  • Add the connector only to your Spark applications, for example with the --jars option. Regarding this, there are some other possibilities that can impact your app, to know more please check Add jars to a Spark Job - spark-submit

Once the jar is added to the classpath you can check the two bigquery connector examples, one of them was already provided by @ David Rabinowitz