Simple test Streaming pipeline (on Dataflow run) :: data not flowing through

Question

I'm writing a simple streaming pipeline (Apache Beam 2.11 SDK, Python 2.7.10) and run it on Dataflow runner, read form Pub/Sub >> apply element-wise beam.Map() transsform >> sink to BigQuery (The code is https://github.com/vibhorj/gcp/blob/master/df/streaming.py)

As you can see in screenshot below, it's just stuck at step 2, map() transform. Input collections has read 265 elements, but Output collections is empty. Even though Data Watermark for this step is progressing in almost realtime !

Nothing gets streamed to BQ either (I confirmed it by running a query: SELECT * FROM sw.payload). Can anyone explain what's wrong in my code that's preventing data form flowing through the pipeline steps? I expected things to stream to BQ sink in almost real time as messages are published to PubSub.

I'm not using any grouping / aggregates transforms and thus don't see any reason Windowing / Triggers could potentially be causing any issues here (correct me if I'm mistaken!).

Thanks in advance for any clue to fix this!

: UPDATE: wrote another pipeline from scratch and it seems to work fine, within <10sec data showed up in BQ! for this pipeline, the data seems to stuck in BQ Streaming buffer (see screenshot, taken @22:15:00). Found another related SO thread Streaming buffer - Google BigQuery but that din't solve my issues either!

Can you share the code in your pipeline and your Map transform? That way we can try to figure out what may be going on — Pablo
Thanks Pablo, the code is here: github.com/vibhorj/gcp/blob/master/df/streaming.py — Vibhor Jain
How long does your pipeline run without writing to BQ? The BQ sink bufffers some data before sending it to BQ. It could be that it's still being buffered. You could specify a lower batch_size to WriteToBigQuery to force it to write more often: github.com/apache/beam/blob/v2.11.0/sdks/python/apache_beam/io/… — Pablo
If that's not helping, then you may need to file a support ticket with GCP so that someone can take a look at your pipeline. — Pablo
Thanks Pablo. I waited 1-2 hours before running query against BQ data. and there still was no data. Only after 4-5 hours wait, the data appeared in BQ table. Is that long delay expected ? Unfortunately I'm not a paid GCP customer (just learning on Free Trial account), so i'm not are if I've possibility to file support ticket... from your response it seems there's nothing fundamentally wrong in my code though, it's just the way the service works or some bug perhaps.. — Vibhor Jain

Pablo Pablo · Accepted Answer · 2019-03-08T23:09:17

Apache Beam's transforms to Read / Write from data sources have a number of optimizations / tricks.

The Apache Beam transform that performs streaming inserts to BigQuery is no exception. It performs batching of rows before writing to BigQuery. This may add a few seconds to delay for the data to be available to query.

It is also the case that BigQuery runs a number of background tasks for query optimization. Streaming inserts are added to a special buffer that is later loaded into the table. This may add extra delay for data availability.

FWIW, 1-2 hours sounds like too long of a delay.

Check out an interesting blog post on the Life of a Streaming Insert

Simple test Streaming pipeline (on Dataflow run) :: data not flowing through

3 Answers