1
votes

BigQuery doesn't stream directly into their long term storage, they first put it into a write optimized store and periodically flush that to the main storage.

I would like to understand BigQuery Streaming buffer better, in the following use cases.

1) what if I delete the bigquery table, and recreate a new bigquery table with the same name right away, when some records still stay in streaming buffer waiting to be flushed into main storage?

For example, if I am streaming a million records into BigQuery. Some of the records still stay in the streaming buffer now, waiting to be flushed to BigQuery's main storage.

At this time, I delete the BigQuery table and recreate the BigQuery table with the same name, would the remain records in the streaming buffer still be flushed into the new recreated table? Or the remain records in the streaming buffer will be dropped?

My guess is that remain records in the streaming buffer will be dropped? My guess is even if delete the table and recreate a table with the same name, the "object id" for the old table and new table should be different.

Am I correct?

2) what if I run delete query trying to delete some records that I had just streamed before?

Same as above, if I streaming 1 million records, some of them still stay in the streaming buffer, at this time, I issue a delete sql which should delete some records I just streamed.

But if the records I want to delete are still in streaming buffer waiting to be flushed into main storage when I send delete sql command, so my delete sql will not be able to delete them (they are not in the BigQuery main storage yet), and then later on, these records will be flushed into main storage. That means my delete sql will fail to delete these records.

Am I correct? If I am correct, then for my delete sql to work, I have to find out if streaming buffer is empty before I issue delete sql? That will make things more complicated.

Thanks!

2
This is really too broad, and there's too many questions rolled into one. I'd suggest reading this great blog post: cloud.google.com/blog/big-data/2017/06/… - Graham Polley
Thanks Graham. I got the answer from YY below. But if anyone is interested in the deeper understanding of how Google BigQuery Streaming works, that was a good article. - searain

2 Answers

3
votes

1) Correct. The "object id" is different and remaining records will be dropped.

2) Kinds of correct. DML statement cannot modify data still in stream buffer. However, the statement will fail if it tries to touch rows still in stream buffer.

1
votes

My experience.

If you don't have to use "streaming into BigQuery", then don't use it.

Right now, most cases I would save the data in Google Cloud Storage and then transfer the data from Google Cloud Storage to BigQuery.

Such as in old cases, I grab data from MySQL, streaming into BigQuery. Now I would

  • Move data from MySQL to Cloud Storage
  • Then move data from Cloud Storage to BigQuery

This solves a lot of issues I run into before when I was using "streaming into BigQuery".