0
votes

I have some questions related to Cloud Composer and BigQuery. We need to import and create an automated process to export tables from BigQuery to Storage. I have 4 options at the moment:

  • bigquery_to_gcs Operator
  • BashOperator: Executing the "bq" command provided by the Cloud SDK on Cloud Composer.
  • Python Function: Create a Python function using the BigQuery API, almost the same as bigquery_to_gcs and execute this function with Airflow.
  • DataFlow: The job will be executed with Airflow too.

I have some thoughts related to the first 3 options thought. If the table is huge, is there a chance to consume a big part of the resources of Cloud Composer? I've been searching if the bashoperator and bigquery operator consumes some resources of Cloud Composer. Always thinking that this process is going to be in production in the future and more dags are running at the same time. If that’s true, Dataflow will be a more convenient option?

A good approach of dataflow is that we can export the table in just one file if we want, that's not possible with the other options if the table is more than 1GB.

1
Agree with Pablo's answer below, but incase performance matters, dataflow could be the right option since bigquery export options (using bq and api) by default uses shared slots and hence throughput is not guaranteed.Akhil Baby

1 Answers

1
votes

BigQuery itself has a feature to export data to GCS. This means that if you use any of the things you mentioned (except for the Dataflow job), you will simply trigger an export job that will be performed and managed by BigQuery.

This means that you do not need to worry about the consumption of cluster resources in Composer. bigquery_to_gcs operator is simply the controller instructing BigQuery to do an export.

So, from the options you mention: bigquery_to_gcs operator, BashOperator, and Python function will incur a similar low cost. Just use whichever you find easier to manage.