We have large volumes (10 to 400 billion) of raw data in BigQuery tables. We have a requirement to process this data to convert and create the data in the form of star schema tables (probably a different dataset in bigquery) which can then be accessed by atscale.
Need pros and cons between two options below:
1. Write complex SQL within BigQuery which reads data form source dataset and then loads to target dataset (used by Atscale).
2. Use PySpark or MapReduce with BigQuery connectors from Dataproc and then load the data to BigQuery target dataset.
The complexity of our transformations involve joining multiple tables at different granularity, using analytics functions to get the required information, etc.
Presently this logic is implemented in vertica using multiple temp tables for faster processing and we want to re-write this processing logic in GCP (Big Query or Data Proc)