I'm investigating a data correctness issue in a regularly-running job that I wrote, and the problem seems to be caused by BigQuery overwriting the same table twice in a non-atomic way. More specifically, I had two copies of the same query running at the same time (due to retry logic), both set to overwrite the same table (using the WRITE_TRUNCATE option), and the resulting table had two copies of every row. I was expecting one query to write a table with the query results and the other query to overwrite it with the same results, rather than ending up with a double-sized table.
My understanding when designing the system was that all BigQuery actions are atomic (based on atomic inserts in big query, Can I safely query a BigQuery table being replaced with WRITE_TRUNCATE, and Views are failing when their underlying table is repopulated). Is the issue I'm running into a bug, or am I misunderstanding the exact guarantees I can expect?
Looking through history, it looks like this has happened in at least 4 separate cases in the past week.
Here's the timeline of what causes this to happen (with the specific details applying to the most noticeable case):
- At about 18:07 April 30th UTC, my code submitted 82 queries at the same time. Each one queried a table ending in conversions_2014_04_30_14 and another table and wrote to a table ending in conversions_2014_04_30_16 (specifying WRITE_TRUNCATE).
- About 25 minutes later, 25 of the queries were still not finished (which is more than usual), so it triggered "retry" logic that gives up on all queries still running and just submits them again (this is to work around an issue I've seen where queries would stay in pending for hours without being run, which I mentioned here: https://code.google.com/p/google-bigquery/issues/detail?id=83&can=1 ). This means that 50 queries were outstanding at once, two of each of the 25 queries that hadn't finished yet.
- After all queries finished, 6 of the 82 resulting tables were twice as big as they should be.
Here's one example:
First query job: 124072386181:job_tzqbfxfLmZv_QMYL6ozlQpWlG5U
Second query job: 124072386181:job_j9_7uJEjtvYbyeVmEVP0u2er9Lk
The resulting table: 124072386181:bigbingo_history.video_task_companions_conversions_2014_04_30_16
And another example:
First query job: 124072386181:job_TQJzGabFT9FtHI05ftTkD5O8KKU
Second query job: 124072386181:job_5hogbjnLX_5a2opEJl9Jacnn53s
Table: 124072386181:bigbingo_history.Item_repetition__Elimination_conversions_2014_04_27_16
The tables haven't been touched since these queries ran (aside from a schema addition for the first table), so they still contain the duplicate rows. One way to confirm this is to see that the queries all had "GROUP BY alternative, bingo_id", but the tables have two of each (alternative, bingo_id) pair.