2
votes

I'm using Aapche Beam (with dataflow runner) to download binary files (weather forecast, roughly 300 files), then decode them, then store the results as CSV and then load the CSV into BigQuery.

------------      ----------      --------------      -------------------
| Download | ---> | Decode | ---> | CSV in GCS | ---> | CSV to BigQuery |
------------      ----------      --------------      -------------------

One of the reasong of moving to Dataflow, was the ease of parallelisation accross workers. The whole flow is taking too much time to execute. Looking at the logs, it seems that I'm missing some parts in the overall understanding of dataflow.

I thought that once an element from upstream was done, then it was immediately available for downstream processing (at least with the default configuration). However, my elements are not made available to downstream directly after they're being processed.

How can I tell the pipeline to process elements as soon at they are available, without any sort of batching ?

1
Can you provide some code to see what operations are you actually doing? Is any particular step the bottleneck? Are all your binary files roughly of the same size, etc.?Guillem Xercavins

1 Answers

1
votes

How can I tell the pipeline to process elements as soon at they are available, without any sort of batching ?

I don't think there is any way to do this for a pipeline with a bounded data source. A Dataflow job running in batch mode won't allow elements to proceed to the next transform until all elements have been processed. There is some intermediate step to persist the processed elements. I believe this is part of how Dataflow guarantees processing of all elements.

process elements as soon at they are available

This is just a Dataflow job running in streaming mode (unbounded data source).