I'm writing a simple streaming pipeline (Apache Beam 2.11 SDK, Python 2.7.10) and run it on Dataflow runner, read form Pub/Sub >> apply element-wise beam.Map() transsform >> sink to BigQuery (The code is https://github.com/vibhorj/gcp/blob/master/df/streaming.py)
As you can see in screenshot below, it's just stuck at step 2, map() transform. Input collections has read 265 elements, but Output collections is empty. Even though Data Watermark for this step is progressing in almost realtime !
Nothing gets streamed to BQ either (I confirmed it by running a query: SELECT * FROM sw.payload
). Can anyone explain what's wrong in my code that's preventing data form flowing through the pipeline steps? I expected things to stream to BQ sink in almost real time as messages are published to PubSub.
I'm not using any grouping / aggregates transforms and thus don't see any reason Windowing / Triggers could potentially be causing any issues here (correct me if I'm mistaken!).
Thanks in advance for any clue to fix this!
: UPDATE: wrote another pipeline from scratch and it seems to work fine, within <10sec data showed up in BQ! for this pipeline, the data seems to stuck in BQ Streaming buffer (see screenshot, taken @22:15:00). Found another related SO thread Streaming buffer - Google BigQuery but that din't solve my issues either!
batch_size
to WriteToBigQuery to force it to write more often: github.com/apache/beam/blob/v2.11.0/sdks/python/apache_beam/io/… – Pablo