1
votes

Current beam pipeline is reading files as stream using FileIO.matchAll().continuously(). This returns PCollection . I want to write these files back with the same names to another gcs bucket i.e each PCollection is one file metadata/readableFile which should be written back to another bucket after some processing. Is there any sink that i should use to achieve writing each PCollection item back to GCS or are there any ways to do it ? Is it possible to create a window per element and then use some GCS sink IO to be able to do this. When operating on a window (even if it has multiple elements) , does beam guarantees that either a window is fully processed or not processed at all , in other words are write operations to GCS or bigquery for a given window atomic and not partial in case of any failures ?

1
Correct me if I'm wrong, but it sounds like you just need to copy files from one destination to another. Do you actually need Dataflow for this? I imagine you could use some combination of gsutil, bq, and Google Cloud Functions to achieve all of this.Andrew Nguonly
yes almost that, i need to go through all events stored on those files and filter out some, mutate some , aggregate some , but within a file level , because that gives me idempotency , so that if i reprocess the same file i get same data . But if there are ways to achieve this by flattening files into streaming windowed events and then using the normal textio/avroio then i would like to know those solution as well.user179156
Thanks for the clarification. I can't think of any ways to achieve this. The requirement of file idempotency may be too complex to implement in Dataflow. A naive solution would be to process 1 file per Dataflow job, which is not unreasonable in my opinion.Andrew Nguonly
oh that is definitely unreasonable. there are millions file generated every day ,we definitely cannot run million dataflow job just to process few megabytes in 1 file where each file is just few throusand events. Running dataflow jobs per file is not a solution at all. not just dataflow , running any kind of binary per file to process one file per instantiation is not a solution. It is similar to saying i will have one server for one requestuser179156

1 Answers

0
votes

Can you simply write a DoFn<ReadableFile, Void> that takes the file and copies it to the desired location using the FileSystems API? You don't need any "sink" to do that - and, in any case, this is what all "sinks" (TextIO.write(), AvroIO.write() etc.) are under the hood anyway: they are simply Beam transforms made of ParDo's and GroupByKey's.