read data from BigQuery and/or Cloud Storage GCS into Dataproc

Question

I am reading data from BigQuery into dataproc spark cluster. If the data in BigQuery table in my case is originally loaded from GCS, then is it better to read data from GCS directly into spark cluster, since BigQuery connector for dataproc (newAPIHadoopRDD) downloads data into Google Cloud Storage bucket first? Any pros and cons between these two methods?

Dennis Huo Dennis Huo · Accepted Answer · 2017-09-29T21:33:45

Using the BigQuery connector is best for cases where you want to abstract away the GCS export/import as much as possible, and don't want to explicitly manage datasets inside of GCS.

If you already have the dataset inside of GCS, it's likely better to use the GCS dataset directly to avoid the additional export steps, as well as being able to use simpler filesystem interfaces directly. The downside is it's more costly to maintain two copies of your dataset (one in GCS and one in BQ) and keep them in sync. But if the size isn't prohibitive and the data isn't updated too frequently, you might find it easiest to keep the GCS dataset around for direct access.

read data from BigQuery and/or Cloud Storage GCS into Dataproc

1 Answers