3
votes

I have a cloud dataflow job that does a bunch of processing for an appengine app. At one stage in the pipeline, I do a group by a particular key, and for each record that matches that key I would like to write a file to Cloud Storage (using that key as part of the file name).

I don't know in advance how many of these records there will be. So this usage pattern doesn't fit the standard cloud dataflow data sink pattern (where the sharding of that output stage determines the # output files, and I have no control over the output file names per shard).

I am considering writing to Cloud Storage directly as a side-effect in a ParDo function, but have the following queries:

  1. Is writing to cloud storage as a side-effect allowed at all?
  2. If I was writing from outside a dataflow pipeline, it seems I should use the Java client for the JSON cloud storage API. But that involves authenticating via OAUTH to do any work: and that seems inappropriate for a job already running on GCE machines as part of a dataflow pipeline: will this work?

Any advice gratefully received.

2

2 Answers

3
votes

Answering the first part of your question:

While nothing directly prevents you from performing side-effects (such as writing to Cloud Storage) in our pipeline code, usually it is a very bad idea. You should consider the fact that your code is not running in a single-threaded fashion on a single machine. You'd need to deal with several problems:

  1. Multiple writers could be writing at the same time. You need to find a way to avoid conflicts between writers. Since Cloud Storage doesn't support appending to an object directly, you might want to use Composite objects techniques.
  2. Workers can be aborted, e.g. in case of transient failures or problems with the infrastructure, which means that you need to be able to handle the interrupted/incomplete writes issue.
  3. Workers can be restarted (after they were aborted). That would cause the side-effect code to run again. Thus you need to be able to handle duplicate entries in your output in one way or another.
0
votes

Nothing in Dataflow prevents you from writing to a GCS file in your ParDo.

You can use GcpOptions.getCredential() to obtain a credential to use for authentication. This will use a suitable mechanism for obtaining a credential depending on how the job is running. For example, it will use a service account when the job is executing on the Dataflow service.