I am trying to load Parquet data into Google BigQuery, to take advantage of the efficient columnar format and also (I hope) to get around BigQuery's lack of support for Logical Types (DATE etc) in AVRO files.
My data contain two levels of nested arrays.
Using JSON I can create and load the table with the desired structure:
bq mk temp.simple_interval simple_interval_bigquery_schema.json
bq load --source_format=NEWLINE_DELIMITED_JSON temp.simple_interval ~/Desktop/simple_interval.json
bq show temp.simple_interval
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels
----------------- ---------------------------------------- ------------ ------------- ------------ ------------------- --------
09 May 13:21:56 |- file_name: string (required) 3 246
|- file_created: timestamp (required)
|- id: string (required)
|- interval_length: integer (required)
+- days: record (repeated)
| |- interval_date: date (required)
| |- quality: string (required)
| +- values: record (repeated)
| | |- interval: integer (required)
| | |- value: float (required)
I have tried to create the same structure with a Parquet data file using AvroParquetWriter. My AVRO schema is:
{
"name": "simple_interval",
"type": "record",
"fields": [
{"name": "file_name", "type": "string"},
{"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
{"name": "id", "type": "string"},
{"name": "interval_length", "type": "int"},
{"name": "days", "type": {
"type": "array",
"items": {
"name": "days_record",
"type": "record",
"fields": [
{"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
{"name": "quality", "type": "string"},
{"name": "values", "type": {
"type": "array",
"items": {
"name": "values_record",
"type": "record",
"fields": [
{"name": "interval", "type": "int"},
{"name": "value", "type": "float"}
]
}
}}
]
}
}}
]
}
From the AVRO specification, and what I find online, it seems that is necessary to nest 'record' nodes inside 'array' nodes in this way.
When I create my Parquet file, Parquet tools reports the schema as:
message simple_interval {
required binary file_name (UTF8);
required int64 file_created (TIMESTAMP_MILLIS);
required binary id (UTF8);
required int32 interval_length;
required group days (LIST) {
repeated group array {
required int32 interval_date (DATE);
required binary quality (UTF8);
required group values (LIST) {
repeated group array {
required int32 interval;
required float value;
}
}
}
}
}
I load the file into BigQuery and examine the results:
bq load --source_format=PARQUET temp.simple_interval ~/Desktop/simple_interval.parquet
bq show temp.simple_interval
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Labels
----------------- --------------------------------------------- ------------ ------------- ------------ ------------------- --------
09 May 13:05:54 |- file_name: string (required) 3 246
|- file_created: timestamp (required)
|- id: string (required)
|- interval_length: integer (required)
+- days: record (required)
| +- array: record (repeated) <-- extra column
| | |- interval_date: date (required)
| | |- quality: string (required)
| | +- values: record (required)
| | | +- array: record (repeated) <-- extra column
| | | | |- interval: integer (required)
| | | | |- value: float (required)
This is workable, but I was wondering, is there a way to avoid the extra 'array' intermediate nodes/columns?
Have I missed something? Is there a way with AVRO/Parquet to obtain the simpler BigQuery table structure as with JSON, for nested arrays?