4
votes

BigQuery generally does a good job of loading Avro data, but "bq load" is having a lot of trouble with timestamps and other date/time fields that use the Avro logicalType attribute.

  1. My data with Avro type timestamp-millis is mangled when BigQuery TIMESTAMP interprets them as microsecond timestamps (off by 1000).
  2. A timestamp-micros integer that can load into TIMESTAMP becomes INVALID in a BigQuery DATETIME. I can't find an explanation of what would be valid at https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
  3. Strings in ISO8601 format can't load into TIMESTAMP or DATETIME (Incompatible types error) but I think BigQuery would support that if I was loading plain JSON.
  4. Avro "date" type fails to load into DATE (also Incompatible types).

I guess I could workaround these problems by always loading the data into temporary fields and using queries to CAST or transform them to additional fields, but that doesn't scale or support schema evolution or stream nicely. Producing data in Avro with well-defined schemas is supposed to avoid that extra step of transforming data again for different consumers.

Is BigQuery really this incompatible with Avro dates and times? (or am I doing something dumb)

Or is "bq load" the problem here? Is there a better way to load Avro data?

4
having a sample file could help debug this problem (if it's a bug, post here code.google.com/p/google-bigquery/issues/list)Felipe Hoffa
I don't know if it's a bug or BigQuery just doesn't support loading Avro data into TIMESTAMP, DATETIME, and DATE data types. cloud.google.com/bigquery/data-formats#avro_format doesn't mention any of those 3 types. I'm not sure how to attach the binary Avro data file here..Kevin
This feature is now supported, follow issuetracker.google.com/35905894 for more information.Victor Mota

4 Answers

3
votes

Native understanding for Avro Logical Types is now available publicly for all BigQuery users. Please refer to the documentation page here for more details: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#logical_types

8
votes

Update: This feature is now supported, follow issuetracker.google.com/35905894 for more information.

As Hua said, Avro Logical Types are not supported in BigQuery but the supported way to load Avro data with timestamps is by using the LONG Avro type to load data into an existing BigQuery table that has a TIMESTAMP column. Also, the value should be microseconds (not seconds or milliseconds) from EPOCH. For example, the Avro file below has the a LONG field with value 1408452095000000 which will represent "2014-08-19 12:41:35".

The Avro file's schema:

% avro-tools getschema ~/dataset/simple_timestamp.avro
{
  "type" : "record",
  "name" : "FullName",
  "fields" : [ {
    "name" : "t",
    "type" : "long"
  } ]
}

Example of loading an Avro file to a table with a Timestamp field:

bq mk --schema t:TIMESTAMP -t vimota.simple_timestamp
bq load --source_format=AVRO vimota.simple_timestamp ~/dataset/simple_timestamp.avro
bq head vimota.simple_timestamp:

+---------------------+
|          t          |
+---------------------+
| 2014-08-19 12:41:35 |
+---------------------+
2
votes

I have data in PostgreSQL table with a TIMESTAMP column. I was able to import it into BigQuery through Avro after following advice in comments on https://github.com/spotify/spark-bigquery/issues/19.

Using PostgreSQL JDBC library in Kotlin, I recalculated the Timestamp into BigQuery internal format (microseconds since beginning of Unix epoch)

(object as java.sql.Timestamp).time * 1000

and put it into my avro record with type Schema.Type.LONG.

Then I created a schema file for my data in JSON where I gave the column type of "timestamp".

[ {"name": "job", "type": "string", "mode": "required"}, ... {"name": "began", "type": "timestamp", "mode": "required"}, ... ]

(see the began field)

Finally, I imported it into BigQuery with

bq mk test.test2 dataset.avro schema.json

The result is

$ bq head test.test2 +------+----+----------+---------------------+---------+-----------+ | job | id | duration | began | status | node_name | +------+----+----------+---------------------+---------+-----------+ | job1 | 1 | 0.0 | 2012-04-01 00:00:00 | aStatus | aNodeName | | job2 | 1 | 0.0 | 2020-02-02 00:02:02 | aStatus | aNodeName | +------+----+----------+---------------------+---------+-----------+

The Web UI does not allow to specify schema for Avro file, but the CLI client and the API do.

The only problem I still have with this is dealing with time zones. But that is not problem with Avro.

1
votes

We had this same issue and figured it out. The Avro schema that works for importing date data into an existing BQ table field of type DATE is below. The integer value of the date value MUST be the number of days since epoch (not seconds). Note the style of the schema type definition using the nested format.

  1. I created a BQ table with one field named "day", type was DATE, mode=REQUIRED.
  2. I created an Avro file with the below schema that had one record containing the integer value 18639 for days
  3. I uploaded that Avro file to a bucket in GCS
  4. I loaded the Avro data into the table using the following and it was converted to a BQ DATE type when I viewed it in the table:

bq load --source_format AVRO --use_avro_logical_types s1.avro_date gs://bucket_name/bq_date_int_logical_nested.avro

echo "select * from s1.avro_date" | bq query

Waiting on bqjob_r1433d5cfa5eb9a89_00000176f3182f03_1 ... (0s) Current status: DONE   
+------------+
|    day     |
+------------+
| 2021-01-12 |
+------------+

Schema used:

{
  "type" : "record",
  "name" : "bq_date",
  "namespace" : "my.namespace",
  "fields" : [{
      "name" : "day",
      "type" : {
        "type" : "int",
        "logicalType" : "date"
      }
    } 
  ]
}

FWIW: We tested test files created with both Avro Python 1.10.1 and Java libraries.