I have a cloud dataflow job that does a bunch of processing for an appengine app. At one stage in the pipeline, I do a group by a particular key, and for each record that matches that key I would like to write a file to Cloud Storage (using that key as part of the file name).
I don't know in advance how many of these records there will be. So this usage pattern doesn't fit the standard cloud dataflow data sink pattern (where the sharding of that output stage determines the # output files, and I have no control over the output file names per shard).
I am considering writing to Cloud Storage directly as a side-effect in a ParDo function, but have the following queries:
- Is writing to cloud storage as a side-effect allowed at all?
- If I was writing from outside a dataflow pipeline, it seems I should use the Java client for the JSON cloud storage API. But that involves authenticating via OAUTH to do any work: and that seems inappropriate for a job already running on GCE machines as part of a dataflow pipeline: will this work?
Any advice gratefully received.