0
votes

I'm writing a simple streaming pipeline (Apache Beam 2.11 SDK, Python 2.7.10) and run it on Dataflow runner, read form Pub/Sub >> apply element-wise beam.Map() transsform >> sink to BigQuery (The code is https://github.com/vibhorj/gcp/blob/master/df/streaming.py)

As you can see in screenshot below, it's just stuck at step 2, map() transform. Input collections has read 265 elements, but Output collections is empty. Even though Data Watermark for this step is progressing in almost realtime !

Nothing gets streamed to BQ either (I confirmed it by running a query: SELECT * FROM sw.payload). Can anyone explain what's wrong in my code that's preventing data form flowing through the pipeline steps? I expected things to stream to BQ sink in almost real time as messages are published to PubSub.

I'm not using any grouping / aggregates transforms and thus don't see any reason Windowing / Triggers could potentially be causing any issues here (correct me if I'm mistaken!).

Thanks in advance for any clue to fix this!

: UPDATE: wrote another pipeline from scratch and it seems to work fine, within <10sec data showed up in BQ! for this pipeline, the data seems to stuck in BQ Streaming buffer (see screenshot, taken @22:15:00). Found another related SO thread Streaming buffer - Google BigQuery but that din't solve my issues either!

3
Can you share the code in your pipeline and your Map transform? That way we can try to figure out what may be going onPablo
How long does your pipeline run without writing to BQ? The BQ sink bufffers some data before sending it to BQ. It could be that it's still being buffered. You could specify a lower batch_size to WriteToBigQuery to force it to write more often: github.com/apache/beam/blob/v2.11.0/sdks/python/apache_beam/io/…Pablo
If that's not helping, then you may need to file a support ticket with GCP so that someone can take a look at your pipeline.Pablo
Thanks Pablo. I waited 1-2 hours before running query against BQ data. and there still was no data. Only after 4-5 hours wait, the data appeared in BQ table. Is that long delay expected ? Unfortunately I'm not a paid GCP customer (just learning on Free Trial account), so i'm not are if I've possibility to file support ticket... from your response it seems there's nothing fundamentally wrong in my code though, it's just the way the service works or some bug perhaps..Vibhor Jain

3 Answers

1
votes

Apache Beam's transforms to Read / Write from data sources have a number of optimizations / tricks.

The Apache Beam transform that performs streaming inserts to BigQuery is no exception. It performs batching of rows before writing to BigQuery. This may add a few seconds to delay for the data to be available to query.

It is also the case that BigQuery runs a number of background tasks for query optimization. Streaming inserts are added to a special buffer that is later loaded into the table. This may add extra delay for data availability.

FWIW, 1-2 hours sounds like too long of a delay.


Check out an interesting blog post on the Life of a Streaming Insert

1
votes

I'd like to add some data as context:

I'm currently streaming Pub/Sub->Dataflow->BigQuery, and the delays are minimal.

SELECT CURRENT_TIMESTAMP(), MAX(ts)
  , TIMESTAMP_MILLIS(CAST(JSON_EXTRACT(ARRAY_AGG(message ORDER BY ts DESC LIMIT 1)[OFFSET(0)], '$.mtime') AS INT64))
FROM `pubsub.meetup_201811` 

2019-03-08 23:29:59.050310 UTC
2019-03-08 23:29:57.620443 UTC
2019-03-08 23:29:57.504 UTC

Let's see:

  • 23:29:57.504 - the original message time, as set by the source.
  • 23:29:57.620443 - timestamp added by the script that reads from the source and pushes to pub/sub
  • 23:29:59.050310 - current time

This shows less than 2 seconds from my script to BigQuery.

Let me run that query again:

2019-03-08 23:36:48.264672 UTC
2019-03-08 23:36:47.020180 UTC
2019-03-08 23:36:46.912 UTC

Here we can see less than 1.2 seconds between script and querying.

And a third time:

2019-03-08 23:40:13.327028 UTC
2019-03-08 23:40:12.428090 UTC
2019-03-08 23:40:12.255 UTC

1.1 seconds.

Note my setup for this pipeline:

  • Plain Pub/Sub.
  • Dataflow to BigQuery provided by GCP's template (Java).
  • Somehow Dataflow reports a slower pipeline than what we can actually see.

enter image description here enter image description here

0
votes

It seems to work now as expected. Looks like there was some intermittent issues, the data was stuck in BQ Streaming buffers for hours and also un-querable with normal SELECT statement...