3
votes

Context: We have a Dataflow job that writes Avro files to GCS with a schema that changes weekly (field additions only). So that means, under a GCS prefix, we have a bunch of Avro files with different schemas, most likely 2 schemas at any given time. For more details, please see the Context section in this post.

The problem: According to this SO post, when loading Avro files with multiple schemas into BigQuery, BigQuery will choose the file with the largest lexico order. However, this is not the behavior I observed. I am observing inconsistent behaviors.

In my first try, my newer schema was picked up and the new fields are there. However, the BQ load itself took much longer than it should. It took 7 minutes to load 368,594 records.

In my second try, the files with larger lexico order are using the new schema, and I am able to open the Avro file and see the new fields in the header. But when I BQ load these files into a table, the added fields are missing. But if I load the file with the largest order individually, the table will have the new fields.

We have a custom file naming policy, which is: "chunk-$windowStart-$windowEnd-shardIndex-of-shardNum-UUID.avro"

Question:

  1. Since BQ does auto schema detection for all Avro files, what exactly is the rule regarding old/new schemas? Especially when only field additions happen?
  2. Why BQ load took so long in my first attempt? Did it load with the old schema, then found out about the new schema halfway, then re-did all the work?
  3. Any suggestions on how I can debug this?
1

1 Answers

0
votes

Google Cloud Support here!

  1. Schema autodetection is an inference process which BigQuery carries out based on a small sample of rows. This means that the inferred schema may vary depending on the sample analysed. This may account for why you have experienced an inconsistent behaviour. For more info, check out this doc

  2. To answer to this question I need more information, so I would encourage you to open a ticket with Google Cloud Support so that we can better assist you.

  3. See answer 2.

I hope that helps.