Data processing - BigQuery vs Data Proc+BigQuery

Question

We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.

Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.

The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)

We are looking at executing them on a daily basis as part of ETL loads or even multiple times a day. — ppeddi

Adrian Adrian · Accepted Answer · 2020-01-15T06:04:44

I went successfully with option 1: Big Query is very capable to run the very complex transformation with SQL, on top of that you can also run them incrementally with time range decorators. Note that it takes a lot of time and resources to take data back and forth to BigQuery. When running BigQuery SQL data never leaves BigQuery in the first place and you already have all raw logs there. So as long your problem can be solved by a series of SQL I believe this is the best way to go.

We moved out Vertica reporting cluster, rewriting successfully ETL last year, with option 1.

Around a year ago, I've written POC comparing DataFlow and series of BigQuery SQL jobs orchestrated by potens.io workflow allowing SQL parallelization at scale.

I took a good month to write DataFlow in Java with 200+ data points and complex transformation with terrible debugging capability at a time. And a week to do the same using a series of SQL with potens.io utilizing Cloud Function for Windowed Tables and parallelization with clustering transient tables. I know there's been bunch improvement in CloudDataFlow since then, but at a time the DataFlow did fine only at a million scale and never-completed at billions record input (main reason shuffle cardinality went little under billions of records, with each records having 200+ columns). And the SQL approach produced all required aggregation under 2 hours for a dozen billion. Debugging and easiest of troubleshooting with potens.io helped a lot too.

Data processing - BigQuery vs Data Proc+BigQuery

2 Answers