Moving from pubsub->bigquery to pubsub->gcs (avro)->bigquery

Question

Our current data pipeline streams our events "directly" to bigquery.
We have a stream of messages in pubsub, which we first read using dataflow, enrich, and write into other pubsub topic, and then we read it using another dataflow job, and write into bigquery.
It works fine, but it doesn't support proper error handling - we just drop invalid messages, instead of handling them, or at least save them for later.
We are thinking on enhancing the process, keep invalid messages aside, and allow simple fix of them later on.
My first approach was writing those problematic messages into a different pubsub topic, and handle them from there, but few people suggested saving them into GCS (maybe as AVRO files) instead.
The question is: if we use GCS and AVRO, why not do it for all messages ? Instead of enriching and writing to pubsub, why not enriching and writing to GCS ?
If we do that, we could use AVROIO() using watchForNewFiles(), and it seems straight forward.
But this sounds too simple, and too good. Before jumping into coding, I am concerned from few things:

I know using windows on dataflow makes the streaming as batched data, but it is much more flexible than checking for new files every X minutes. How would I, for example, handle late data, etc. ?
The job runs endlessly, the AVRO files will be piled into one bucket, and watchForNewFiles() suppose to work flawlessly as is ? Would it be based on file timestamp ? naming format ? Keeping "list" of known old files ?? Reading FileIO code, it seems the method is quite naive, which means the bigger the bucket grows, the longer the match will take.

Do I miss anything ? Isn't this solution fit less for endless streaming than pubsub ?

Is there any reason you have 2 different pipelines instead of having just one doing all the work? Also, not sure I understood this right, wouldn't the late have to be handled when writing the AVRO rather than when reading? — Iñigo

ihji ihji · Accepted Answer · 2020-01-03T02:32:34

There is a set of APIs which controls how to handle the late data
I guess it would be problematic if you poll a single infinitely growing GCS bucket with watchForNewFiles(). I couldn't find the official document mentioning scalability of list API call but it's reasonable to think it has O(n) complexity. If you want to use your pipeline in production environment and have GCP support subscription, I would recommend you to talk to GCP support about scalability of polling a large GCS bucket.

Moving from pubsub->bigquery to pubsub->gcs (avro)->bigquery

1 Answers