streaming write to gcs using apache beam per element

Question

Current beam pipeline is reading files as stream using FileIO.matchAll().continuously(). This returns PCollection . I want to write these files back with the same names to another gcs bucket i.e each PCollection is one file metadata/readableFile which should be written back to another bucket after some processing. Is there any sink that i should use to achieve writing each PCollection item back to GCS or are there any ways to do it ? Is it possible to create a window per element and then use some GCS sink IO to be able to do this. When operating on a window (even if it has multiple elements) , does beam guarantees that either a window is fully processed or not processed at all , in other words are write operations to GCS or bigquery for a given window atomic and not partial in case of any failures ?

Correct me if I'm wrong, but it sounds like you just need to copy files from one destination to another. Do you actually need Dataflow for this? I imagine you could use some combination of gsutil, bq, and Google Cloud Functions to achieve all of this. — Andrew Nguonly
yes almost that, i need to go through all events stored on those files and filter out some, mutate some , aggregate some , but within a file level , because that gives me idempotency , so that if i reprocess the same file i get same data . But if there are ways to achieve this by flattening files into streaming windowed events and then using the normal textio/avroio then i would like to know those solution as well. — user179156
Thanks for the clarification. I can't think of any ways to achieve this. The requirement of file idempotency may be too complex to implement in Dataflow. A naive solution would be to process 1 file per Dataflow job, which is not unreasonable in my opinion. — Andrew Nguonly
oh that is definitely unreasonable. there are millions file generated every day ,we definitely cannot run million dataflow job just to process few megabytes in 1 file where each file is just few throusand events. Running dataflow jobs per file is not a solution at all. not just dataflow , running any kind of binary per file to process one file per instantiation is not a solution. It is similar to saying i will have one server for one request — user179156

jkff jkff · Accepted Answer · 2018-01-15T00:33:51

Can you simply write a DoFn<ReadableFile, Void> that takes the file and copies it to the desired location using the FileSystems API? You don't need any "sink" to do that - and, in any case, this is what all "sinks" (TextIO.write(), AvroIO.write() etc.) are under the hood anyway: they are simply Beam transforms made of ParDo's and GroupByKey's.

streaming write to gcs using apache beam per element

1 Answers