0
votes

I'm trying to load a BigQuery table into my program using Spark, Scala but I'm having trouble understanding the role of 'buckets' in BigQuery.

I followed the examples on https://github.com/samelamin/spark-bigquery and on https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example in that I changed the projectId into my own and I've downloaded a service account .json file for authentication.

Here's my code

import com.samelamin.spark.bigquery._

class SparkSessionFunctions(val spark: SparkSession) {
def loadBQTable[T]: Unit = {
    val sqlContext = spark.sqlContext
    sqlContext.setBigQueryGcsBucket("bucketname") // What's this for?
    sqlContext.setBigQueryProjectId("data-staging-5c4d")
    sqlContext.setGcpJsonKeyFile("/key.json")
    sqlContext.hadoopConf.set("fs.gs.project.id","data-staging-5c4d")

    val df = spark.sqlContext.read.format("com.samelamin.spark.bigquery").option("tableReferenceSource","data-staging-5c4d:data_warehouse.table_to_load").load()
    println("df: " + df.select("id").collect())
    df
  }
}

Running the command prinitln(df) was able to show my table schema but I'm not able to collect anything from the table itself due to an error that says my service account does not have storage.objects.get access to bucket bucketname/hadoop/tmp/bigquery/job_20190626140444_0000.

To my understanding, buckets are only used in GCS and is not used in BigQuery at all. So why is it that both libraries needed a bucket value specified for it to work?

1
You are correct, BigQuery doesn't have buckets, they are a feature of Cloud Storage only. A bucket is a common mid-point for data flowing into or out of BigQuery. - Ben P
Bigquery spark connector like this read spark data -> write to GCS bucket -> write to staging data set -> finally write to original data set - Yogesh

1 Answers

4
votes

In this case, the bucket has nothing to do with BigQuery—rather with Google Cloud Storage. The Spark connector actually transfers the data first to GCS as a staging area (hence why the bucket is required), and then into BigQuery.

A number of connectors work this way because you can directly query from a CSV in Cloud Storage via what's called an External Data Source in BigQuery. This lets you treat a file in Cloud Storage as a table and query it with BigQuery's compute, including the option to auto-detect schema—and tend to be a faster way to ELT/ETL data versus more complex integrations with the BigQuery API.

To correct the cause of the error you're seeing, you'll want to assign your service account the appropriate permission role in the console under IAM & Admin (most likely Storage Object Viewer).

Update: You may also want to check permissions on the bucket as well as the object (file) itself, since these can override the default IAM role permissions. You can do this from the permissions tab at the top of the Cloud Storage browser for the bucket, and from the kabob (three dot) menu next to the individual files, then selecting the "Edit permissions" option. Note that these object-level permissions actually aren't part of IAM, but rather of Access Control Lists (ACLs)—so they can be a little confusing at first.

One more thing I missed previously—the error message you included typically only contains the bucket name, but has a full path. I would make sure that the call to sqlContext.setBigQueryGcsBucket() contains only the quoted bucket name not including the path to the file.

Once you get past the permissions issue on the file, you may need to add a call to sqlContext.setBigQueryDatasetLocation("[your region code here]") where the region code is the correct Asia Pacific region from this list (note: Asia Pacific is a bit different; most tools use the multi-region strings of "US" or "EU", but will accept the longer single-region names as well.