I'm using Aapche Beam
(with dataflow runner) to download binary files (weather forecast, roughly 300 files), then decode them, then store the results as CSV and then load the CSV into BigQuery
.
------------ ---------- -------------- -------------------
| Download | ---> | Decode | ---> | CSV in GCS | ---> | CSV to BigQuery |
------------ ---------- -------------- -------------------
One of the reasong of moving to Dataflow, was the ease of parallelisation accross workers. The whole flow is taking too much time to execute. Looking at the logs, it seems that I'm missing some parts in the overall understanding of dataflow.
I thought that once an element from upstream was done, then it was immediately available for downstream processing (at least with the default configuration). However, my elements are not made available to downstream directly after they're being processed.
How can I tell the pipeline to process elements as soon at they are available, without any sort of batching ?