Google BigQuery has on March 23, 2016 announced "Added support for Avro source format for load operations and as a federated data source in the BigQuery API or command-line tool". It says here "This is a Beta release of Avro format support. This feature is not covered by any SLA or deprecation policy and may be subject to backward-incompatible changes.". However, I'd expect the feature to work.
I didn't find anywhere code examples on how to use Avro format for loading. Neither I did find examples on how to use bq
-tool for loading.
Here's my practical issue. I haven't been able to load data into BigQuery in Avro-format.
The following happens using bq
-tool. The dataset, table name and bucket name have been obfuscated:
$ bq extract --destination_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro
Waiting on bqjob_r62088699049ce969_0000015432b7627a_1 ... (36s) Current status: DONE
$ bq load --source_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro
Waiting on bqjob_r6cefe75ece6073a1_0000015432b83516_1 ... (2s) Current status: DONE
BigQuery error in load operation: Error processing job 'dataset:bqjob_r6cefe75ece6073a1_0000015432b83516_1': An internal error occurred and the request could not be completed.
Basically, I am extracting from a table and inserting to the same table causing an internal error.
Additionally, I have Java program that does the same (extract from table X and load to table X) with the same result (internal error). But I think the above illustrates the problem as clearly as possible, and because of that I'm not sharing the code here. In Java, If I extract from an empty table and insert that, the insert job doesn't fail.
My questions are
- I think BigQuery API should never fail with internal error. Why is that happening with my test?
- Is the extracted Avro file compatible with an insert job?
- There seems to be no specification what the Avro schema in an insert job is like, at least I couldn't find any. Could the documentation be created?
UPDATED 2016-04-25:
So far I've managed to get an Avro load job not to give an internal error based on the hint of not using REQUIRED
fields. However, I haven't managed to load non-null values.
Consider this Avro-schema:
{
"type": "record",
"name": "root",
"fields": [
{
"name": "x",
"type": "string"
}
]
}
The BigQuery table has one column, x
that is NULLABLE
.
If I insert N (I've tried with one and two) rows (x
being e.g. 1
), I got N rows in BigQuery but x
always having value null
.
If I change the table so that X
is REQUIRED
I get an internal error.