Streaming buffer - Google BigQuery

Question

I'm developing a python program to use like Google Dataflow template.

What I'm doing is writing the data in BigQuery from PubSub:

 pipeline_options.view_as(StandardOptions).streaming = True
    p = beam.Pipeline(options=pipeline_options)

    (p
     # This is the source of the pipeline.
     | 'Read from PubSub' >> beam.io.ReadFromPubSub('projects/.../topics/...')
     #<Transformation code if needed>
     # Destination
     | 'String To BigQuery Row' >> beam.Map(lambda s: dict(Trama=s))
     | 'Write to BigQuery' >> beam.io.Write(
                beam.io.BigQuerySink(
                    known_args.output,
                    schema='Trama:STRING',
                    create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
                ))
     )
    p.run().wait_until_finish()

The code is running in local, not in Google Dataflow yet

This "works" but not the way i want, because currently the data are stored in the BigQuery Buffer Stream and I can not see it (even after waiting some time).

When are gonna be available in BigQuery? Why are stored in the buffer stream instead of the "normal" table?

The data is there, but you need to run a query to retrieve it instead of looking at the table preview, which will not show data in the streaming buffer. — Elliott Brossard
Thanks.Before post the Q, I tried to see the data running the query: SELECT * FROM Table LIMIT 1000 even with a C# code instead of using the Google Cloud console but still are missing. I imagine that it may be due I running this pipeline locally and like is an streaming pipeline it doesn't finish, so to finish it I have to push the stop buttom. Here is where I can see the different between stop and drain it, but not sure if this could be the problem — IoT user

dstanco dstanco · Accepted Answer · 2019-07-11T14:45:58

In your example you create a Dataflow which streams data into BigQuery. Streaming means - as you write - that the data doesn't get to its permanent place in an instant but after a while (up to 2 hours), which state is actually the Streaming Buffer. There is no difference in this case between the runners - you run it locally (DirectRunner) or in the cloud (DataflowRunner) - because both solutions use cloud resources (write into cloud BigQuery directly). If you use emulators for local development, that's another case (but as far as I know BQ does not have one yet).

Here you can find a pretty good article on how this architecture looks like and how streaming into BigQuery works in deep: https://cloud.google.com/blog/products/gcp/life-of-a-bigquery-streaming-insert.

The reason why you could not see your data immediately is because the Preview button works probably with the Columnar Permanent storage of BQ.

If you'd like to see the data in the buffer use a query like:

SELECT * FROM `project_id.dataset_id.table_id` WHERE _PARTITIONTIME IS NULL

Querying the buffer is free of charge, by the way.

I hope it helped a bit to clear things up.

Streaming buffer - Google BigQuery

2 Answers