0
votes

We run a nightly query against BigQuery via the Java REST API that specifies a destination table for the results to be pushed to (write disposition=WRITE_TRUNCATE). Today's query appeared to run without errors but the results were not pushed to the destination table.

This query has been running for a few weeks now and we've had no issues. No code changes were made either.

Manually running it a second time after it "failed" worked fine. It was just this one glitch that we spotted and we're concerned it may happen again.

Our logged JSON response from the "failed" query looks fine (I've obfuscated any sensitive data):

INFO: Job finished successfully: {
"configuration" : {
"dryRun" : false,
"query" : {
  "createDisposition" : "CREATE_IF_NEEDED",
  "destinationTable" : {
    "datasetId" : "[REMOVED]",
    "projectId" : "[REMOVED]",
    "tableId"   : "[REMOVED]"
  },
  "priority" : "INTERACTIVE",
  "query" : "[REMOVED]",
  "writeDisposition" : "WRITE_TRUNCATE"
 }
},
"etag" : "[REMOVED]",
"id" : "[REMOVED]",
"jobReference" : {
 "jobId" : "[REMOVED]",
 "projectId" : "[REMOVED]"
},
"kind" : "bigquery#job",
"selfLink" : "[REMOVED]",
"statistics" : {
 "creationTime" : "1390435780070",
 "endTime" : "1390435780769",
 "query" : {
  "cacheHit" : false,
  "totalBytesProcessed" : "12546"
},
 "startTime" : "1390435780245",
 "totalBytesProcessed" : "12546"
 },
 "status" : {
 "state" : "DONE"
 }
}

Using the "try it!" for Jobs/GET here and plugging in the job id also shows the job was indeed successful and matches our logged output (pasted above).

Checking the web console shows the destination table has been truncated but not updated. Weirdly, the "Last Modified" has not been updated (I did try refreshing the page numerous times):

http://i.stack.imgur.com/384NL.png

Has anyone experienced this before with BigQuery - a query appearing to run successfully but if a destination/reference table was specified the results were not pushed yet the table was truncated?

2

2 Answers

0
votes

I am a developer on the BigQuery team. I've looked up the details of you job from the breadcrumbs you left (your query was the only one that started at that start time).

It looks like your destination table was truncated at 4:09 pm today PST, which is the time your job ran, but it was left empty -- the query that truncated it didn't actually fill in any information.

I'm having a little bit of trouble piecing together the details, because one of the source tables appears to have been overwritten (the left table in your left outer of join was created at 4:20 PM).

However, there is a clue in the "total bytes processed" field -- it says that the query only processed 12K of data. The internal statistics say that only 384 rows were involved in the query among both tables that were involved.

My guess is that the query legitimately returned 0 rows, so the table was cleared.

There is a bug in that deleting all of the data in a table doesn't update the last modified time. We use last modified to mean either ast time the metadata was updated (like description, schema, etc) or the last time the table had data added to it). But if you just truncate the table, that doesn't update the metadata or add data, so we end up with a stale last-modified time.

If this doesn't sound like a reasonable chain of events, we'll need more information from you about how to debug it (especially since it looks like the tables involved have been modified since you ran this query), and a way that we can reproduce it would be great.

0
votes

So, we figured out what the problem is with this. It failed again a few times over the last few days so we dug in further.

The query that is being executed is dependant on a another query which is executed immediately before it. Although we do wait for the first query to finish (job status = "DONE"), it appears that behind the scenes it's actually not fully complete and it's data is not yet available to be used.

Current process is:

  1. Fetch data from another data source and stream the results into table A
  2. When (1) is complete (poll job id and get status "DONE") submit another query which uses the results in table A to join on to create table B
  3. Table A's data is not yet available so query from (2) results in an empty table

We've noticed it takes about 5-10 seconds for the data to actually appear and be available in BigQuery when using streaming for the first query.

We used a fairly ugly workaround - simply wait a few seconds after the first query before running the next one. Not exactly elegant but it works.