I'm trying to load a BigQuery table into my program using Spark, Scala but I'm having trouble understanding the role of 'buckets' in BigQuery.
I followed the examples on https://github.com/samelamin/spark-bigquery and on https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example in that I changed the projectId into my own and I've downloaded a service account .json file for authentication.
Here's my code
import com.samelamin.spark.bigquery._
class SparkSessionFunctions(val spark: SparkSession) {
def loadBQTable[T]: Unit = {
val sqlContext = spark.sqlContext
sqlContext.setBigQueryGcsBucket("bucketname") // What's this for?
sqlContext.setBigQueryProjectId("data-staging-5c4d")
sqlContext.setGcpJsonKeyFile("/key.json")
sqlContext.hadoopConf.set("fs.gs.project.id","data-staging-5c4d")
val df = spark.sqlContext.read.format("com.samelamin.spark.bigquery").option("tableReferenceSource","data-staging-5c4d:data_warehouse.table_to_load").load()
println("df: " + df.select("id").collect())
df
}
}
Running the command prinitln(df) was able to show my table schema but I'm not able to collect anything from the table itself due to an error that says my service account does not have storage.objects.get access to bucket bucketname/hadoop/tmp/bigquery/job_20190626140444_0000.
To my understanding, buckets are only used in GCS and is not used in BigQuery at all. So why is it that both libraries needed a bucket value specified for it to work?
read spark data -> write to GCS bucket -> write to staging data set -> finally write to original data set- Yogesh