We've been using Dataflow in batch mode for a while now. However, we can't seem to find much info on its streaming mode.
We have a the following use case:
- Data/events are being streamed real-time into BigQuery
- We need to transform/clean/denormalize the data before analysis by the business
Now, we could of course use Dataflow in batch mode, and take chucks of the data from BigQuery (based on timestamps), and transform/clean/denormalize it that way.
But that's a bit of a messy approach, especially because data is being streamed real-time and it will probably get real gnarly working out which data needs to be worked on. Sounds brittle too.
It would be great if we could simply transform/clean/denormalize in Dataflow, and then write to BigQuery as it's streaming in.
Is this what Dataflow streaming is intended for? If so, what data source can Dataflow read from in streaming mode?