8
votes

I am streaming data into a BigQuery table.

  • Delete the old table
  • Create a new table with same name and same schema
  • Stream data into new table

I had done this quite a few times before, it was working fine. But recently I started to see the above approach not working.

After streaming is done (no error reported), I query the table, sometimes it worked. Sometimes, I got empty table. (Same script, same data, run many times, the results are different. Sometimes works, sometime not.)

And to add to the mystery, when I streamed large amount data, it seemed working most of the times. But when I streamed small amount data, then it failed most of the times.

But if I just do

  • Create a new table
  • Stream data into the new table

It always works.

I tried this both in Google Apps Scrip and PHP Google Cloud Client Library for BigQuery. I had the same problems.

So I tried this in Google Apps Script

  • Delete the old table
  • Sleep 10 seconds, so the delete job should be done
  • Create a new table with same name and same schema
  • Sleep 10 seconds, so the create job should be done
  • Stream data into new table

It still gave me the same problems.

But there are no error reported or logged.

Additional Information:

I tried again.

If I wait until the stream buffer is empty, and then run the script. The results are always correct. The new data streamed into the new table successfully.

But if I run the script, right after previous running, then the results are empty. The data is not streamed into the new table.

So error seems happening when I "delete the old table and create the new table" when stream buffer is not empty.

But according to the answer from this thread, BigQuery Stream and Delete while streaming buffer is not empty?,

the old table and new table (even they are with the same name and same schema), they are with two different "object id". They are actually two different tables. After I delete the old table, the old records in stream buffer would be dropped too. Stream buffer is empty or not, it should not affect my next steps, create a new table and stream new data to the new table.

On the other hand, if I try to "truncate old table", instead of "delete old table and create a new table", while there might still be data in stream buffer, then "DML statement cannot modify data still in stream buffer", so "truncate old table" would fail.

In simple words, in this use case,

  • I cannot truncate the old table, because the steam buffer may not be empty.
  • I am supposed to "delete old table and create new table, then stream data to new table". But it seems it is the root of my current problems, my new data cannot be streamed to new table (even the new table is with a new object id, and it should not be affected by the fact I just delete an old table)
2
You asked a similar question previously in stackoverflow.com/q/48331407/6253347, and Graham shared cloud.google.com/blog/products/gcp/…, which explains what you are seeing. You should truncate the table rather than deleting and recreating with the same schema...Elliott Brossard
@ElliottBrossard, my understanding of the answer from stackoverflow.com/q/48331407/6253347 is opposite. I thought I should delete the old table and create a new table. Because the object id is different, so they are two new tables actually. "1) Correct. The "object id" is different and remaining records will be dropped." - so if I drop the old table, the remaining records in buffering for the old table would be dropped too. they would not stream into new table. Then I will stream new data to new table, the new table is with new object id, so the new records will be streamed to new table.searain
@ElliottBrossard 2) Kinds of correct. DML statement cannot modify data still in stream buffer. However, the statement will fail if it tries to touch rows still in stream buffer." If I truncate the table, then there are some records that might be still in stream buffer, when I truncate the table, the statement would fail. So my understanding is I should delete old table and create a new table, instead of truncate the old table.searain

2 Answers

3
votes

Avoid truncating and recreating tables while streaming.

From the official docs:

https://cloud.google.com/bigquery/troubleshooting-errors#streaming

Table Creation/Deletion - Streaming to a nonexistent table will return a variation of a notFound response. Creating the table in response may not immediately be recognized by subsequent streaming inserts. Similarly, deleting and/or recreating a table may create a period of time where streaming inserts are effectively delivered to the old table and will not be present in the newly created table.

Table Truncation - Truncating a table's data (e.g. via a query job that uses writeDisposition of WRITE_TRUNCATE) may similarly cause subsequent inserts during the consistency period to be dropped.

To avoid losing data: Create a new table with a different name.

1
votes

I posted in another thread of mine regarding streaming into BigQuery. Now as a rule, I am trying to avoid streaming if I can.

  • Load the data to Cloud Storage
  • Then load data from Cloud Storage to BigQuery

Which will solve many streaming related issues.