3
votes

I use Google Cloud Dataflow implementation in Python on Google Cloud Platform. My idea is to use input from AWS S3.

Google Cloud Dataflow (which is based on Apache Beam) supports reading files from S3. However, I cannot find in documentation the best possiblity to pass credentials to a job. I tried adding AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to environment variables within setup.py file. However, it work locally, but when I package Cloud Dataflow job as a template and trigger it to run on GCP, it sometimes work, and sometimes not, raising "NoCredentialsError" exception and causing job to fail.

Is there any coherent, best-practice solution to pass AWS credentials to Python Google Cloud Dataflow job on GCP?

1
How do you trigger your job ? If it is through the CLI, have you used/set the awsAccessKey and awsSecretKey flags? - Alexandre Moraes
I am using Google Cloud SDK to save template and then templates.launch method from Cloud Dataflow API (cloud.google.com/dataflow/docs/reference/rest). There is no possibility to set awsAccessKey or awsSecretKey in documentation. - Stanisław Smyl
By setting the two flags I was referencing this documentation. Would it work for you? - Alexandre Moraes
No really, I am not transfering data from S3 to Google Cloud Storage. I am using S3 object as an input to Cloud Dataflow job. - Stanisław Smyl

1 Answers

2
votes

The options to configure this have been added finally. They are available on Beam versions after 2.26.0.

The pipeline options are --s3_access_key_id and --s3_secret_access_key.


Unfortunately, the Beam 2.25.0 release and earlier don't have a good way of doing this, other than the following:

In this thread a user figured out how to do it in the setup.py file that they provide to Dataflow in their pipeline.