Our current data pipeline streams our events "directly" to bigquery.
We have a stream of messages in pubsub, which we first read using dataflow, enrich, and write into other pubsub topic, and then we read it using another dataflow job, and write into bigquery.
It works fine, but it doesn't support proper error handling - we just drop invalid messages, instead of handling them, or at least save them for later.
We are thinking on enhancing the process, keep invalid messages aside, and allow simple fix of them later on.
My first approach was writing those problematic messages into a different pubsub topic, and handle them from there, but few people suggested saving them into GCS (maybe as AVRO files) instead.
The question is: if we use GCS and AVRO, why not do it for all messages ? Instead of enriching and writing to pubsub, why not enriching and writing to GCS ?
If we do that, we could use AVROIO() using watchForNewFiles(), and it seems straight forward.
But this sounds too simple, and too good. Before jumping into coding, I am concerned from few things:
- I know using windows on dataflow makes the streaming as batched data, but it is much more flexible than checking for new files every X minutes. How would I, for example, handle late data, etc. ?
- The job runs endlessly, the AVRO files will be piled into one bucket, and
watchForNewFiles()suppose to work flawlessly as is ? Would it be based on file timestamp ? naming format ? Keeping "list" of known old files ?? ReadingFileIOcode, it seems the method is quite naive, which means the bigger the bucket grows, the longer the match will take.
Do I miss anything ? Isn't this solution fit less for endless streaming than pubsub ?