I have set up an airflow workflow that ingests some files from s3 to Google Cloud storage and then runs a workflow of sql queries to create new tables on Big Query. At the end of the workflow I need to push the output of the one final Big Query table to Google Cloud Storage and from there to S3.
I have cracked the the transfer of the Big Query table to Google Cloud Storage with no issues using the BigQueryToCloudStorageOperator
python operator. However it seems the transfer from Google Cloud Storage to S3 is a less trodden route and I have been unable to find a solution which I can automate in my Airflow workflow.
I am aware of rsync
which comes as part of the gsutil
and have gotten this working (see post Exporting data from Google Cloud Storage to Amazon S3) but I am unable to add this into my workflow.
I have a dockerised airflow container running on a compute engine instance.
Would really appreciate help solving this problem.
Many thanks!