1
votes

Google BigQuery has on March 23, 2016 announced "Added support for Avro source format for load operations and as a federated data source in the BigQuery API or command-line tool". It says here "This is a Beta release of Avro format support. This feature is not covered by any SLA or deprecation policy and may be subject to backward-incompatible changes.". However, I'd expect the feature to work.

I didn't find anywhere code examples on how to use Avro format for loading. Neither I did find examples on how to use bq-tool for loading.

Here's my practical issue. I haven't been able to load data into BigQuery in Avro-format.

The following happens using bq-tool. The dataset, table name and bucket name have been obfuscated:

$ bq extract --destination_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro Waiting on bqjob_r62088699049ce969_0000015432b7627a_1 ... (36s) Current status: DONE $ bq load --source_format=AVRO dataset.events_avro_test gs://BUCKET/events_bq_tool.avro Waiting on bqjob_r6cefe75ece6073a1_0000015432b83516_1 ... (2s) Current status: DONE BigQuery error in load operation: Error processing job 'dataset:bqjob_r6cefe75ece6073a1_0000015432b83516_1': An internal error occurred and the request could not be completed.

Basically, I am extracting from a table and inserting to the same table causing an internal error.

Additionally, I have Java program that does the same (extract from table X and load to table X) with the same result (internal error). But I think the above illustrates the problem as clearly as possible, and because of that I'm not sharing the code here. In Java, If I extract from an empty table and insert that, the insert job doesn't fail.

My questions are

  • I think BigQuery API should never fail with internal error. Why is that happening with my test?
  • Is the extracted Avro file compatible with an insert job?
  • There seems to be no specification what the Avro schema in an insert job is like, at least I couldn't find any. Could the documentation be created?

UPDATED 2016-04-25:

So far I've managed to get an Avro load job not to give an internal error based on the hint of not using REQUIRED fields. However, I haven't managed to load non-null values.

Consider this Avro-schema:

{ "type": "record", "name": "root", "fields": [ { "name": "x", "type": "string" } ] }

The BigQuery table has one column, x that is NULLABLE.

If I insert N (I've tried with one and two) rows (x being e.g. 1), I got N rows in BigQuery but x always having value null.

If I change the table so that X is REQUIRED I get an internal error.

3

3 Answers

0
votes

There is no exact match from a BQ schema to Avro schema, and vice versa, so when you export a BQ table to Avro file and then import back, the schema will be different. I see the destination table of your load already exists, in this case we throw an error when the schema of the destination table doesn't match the schema we converted from the Avro schema. This should be an external error though, we're investigating why it's an internal error.

We're in the middle of upgrading the export pipeline, and the new import pipeline has a bug that doesn't work with the Avro file exported by the current pipeline. The fix should be deployed in a couple weeks. After that, if you import the exported file to a non-existent destination table, or a destination table with compatible schema, it should work. Meanwhile, importing your own Avro files should work. You can also query it directly on GCS without importing it.

0
votes

There's a problem with the error mapping for the AVRO reader here. The error should have been along the lines of: "The reference schema differs from the existing data: The required field 'api_key' is missing"

Looking at your load job configuration, it includes REQUIRED fields. It sounds like some of the data you are trying to load doesn't specify these required fields, so the operation fails.

I suggest avoiding required fields.

0
votes

So, there's a bug in BigQuery: an insert job using Avro format does not work if the destination table exists, but gives an internal error. The workaround is to use createDisposition CREATE_IF_NEEDED and not to have the pre-existing table there. I verified that this works.

Hua Zung's comment says that the bug will be fixed in "the fix should be deployed in a couple weeks". Needless to say that existing major bugs in the live system should be documented somewhere.

While updating the system, I really recommend improving the Avro documentation. Now there's no mention on what the Avro schema should be like (type record, name root and fields array having the columns(?)) and not even the fact that each record in the Avro file maps to a row in the destination table (obvious, but should be mentioned). Also what happens with schema mismatch is not documented.

Thanks for the help, I'll be now switching to Avro-format. It's so much better than CSV.