How to perform S3 to BigQuery using Airflow?

Question

Currently there is no S3ToBigQuery operator.

My choices are:

Use the S3ToGoogleCloudStorageOperator and then use the GoogleCloudStorageToBigQueryOperator

This is not something i'm eager to do. This means paying double for storage. Even if removing the file from either one of the storage that still involves payment.
Download the file from S3 to local file system and load it to BigQuery from file system - However there is no S3DownloadOperator This means writing the whole process from scratch without Airflow involvement. This misses the point of using Airflow.

Is there another option? What would you suggest to do?

Damon Cool Damon Cool · Accepted Answer · 2018-09-24T20:48:01

This is what I ended up with. This should be converted to a S3toLocalFile Operator.

def download_from_s3(**kwargs):
    hook = S3Hook(aws_conn_id='project-s3')    

    result = hook.read_key(bucket_name='stage-project-metrics',
                           key='{}.csv'.format(kwargs['ds']))

    if not result:
        logging.info('no data found')
    else:
        outfile = '{}project{}.csv'.format(Variable.get("data_directory"),kwargs['ds'])

        f=open(outfile,'w+')
        f.write(result)
        f.close()

    return result

How to perform S3 to BigQuery using Airflow?

3 Answers