0
votes

I am trying to load Parquet data into Google BigQuery, to take advantage of the efficient columnar format and also (I hope) to get around BigQuery's lack of support for Logical Types (DATE etc) in AVRO files.

My data contain two levels of nested arrays.

Using JSON I can create and load the table with the desired structure:

bq mk temp.simple_interval simple_interval_bigquery_schema.json
bq load --source_format=NEWLINE_DELIMITED_JSON temp.simple_interval ~/Desktop/simple_interval.json
bq show temp.simple_interval

   Last modified                    Schema                   Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- ---------------------------------------- ------------ ------------- ------------ ------------------- --------
  09 May 13:21:56   |- file_name: string (required)          3            246
                    |- file_created: timestamp (required)
                    |- id: string (required)
                    |- interval_length: integer (required)
                    +- days: record (repeated)
                    |  |- interval_date: date (required)
                    |  |- quality: string (required)
                    |  +- values: record (repeated)
                    |  |  |- interval: integer (required)
                    |  |  |- value: float (required)

I have tried to create the same structure with a Parquet data file using AvroParquetWriter. My AVRO schema is:

{
  "name": "simple_interval",
  "type": "record",
  "fields": [
    {"name": "file_name", "type": "string"},
    {"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
    {"name": "id", "type": "string"},
    {"name": "interval_length", "type": "int"},
    {"name": "days", "type": {
      "type": "array",
      "items": {
        "name": "days_record",
        "type": "record",
        "fields": [
          {"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
          {"name": "quality", "type": "string"},
          {"name": "values", "type": {
            "type": "array",
            "items": {
              "name": "values_record",
              "type": "record",
              "fields": [
                {"name": "interval", "type": "int"},
                {"name": "value", "type": "float"}
              ]
            }
          }}
        ]
      }
    }}
  ]
}

From the AVRO specification, and what I find online, it seems that is necessary to nest 'record' nodes inside 'array' nodes in this way.

When I create my Parquet file, Parquet tools reports the schema as:

message simple_interval {
  required binary file_name (UTF8);
  required int64 file_created (TIMESTAMP_MILLIS);
  required binary id (UTF8);
  required int32 interval_length;
  required group days (LIST) {
    repeated group array {
      required int32 interval_date (DATE);
      required binary quality (UTF8);
      required group values (LIST) {
        repeated group array {
          required int32 interval;
          required float value;
        }
      }
    }
  }
}

I load the file into BigQuery and examine the results:

bq load --source_format=PARQUET temp.simple_interval ~/Desktop/simple_interval.parquet
bq show temp.simple_interval

   Last modified                      Schema                      Total Rows   Total Bytes   Expiration   Time Partitioning   Labels
 ----------------- --------------------------------------------- ------------ ------------- ------------ ------------------- --------
  09 May 13:05:54   |- file_name: string (required)               3            246
                    |- file_created: timestamp (required)
                    |- id: string (required)
                    |- interval_length: integer (required)
                    +- days: record (required)
                    |  +- array: record (repeated)           <-- extra column
                    |  |  |- interval_date: date (required)
                    |  |  |- quality: string (required)
                    |  |  +- values: record (required)
                    |  |  |  +- array: record (repeated)     <-- extra column
                    |  |  |  |  |- interval: integer (required)
                    |  |  |  |  |- value: float (required)

This is workable, but I was wondering, is there a way to avoid the extra 'array' intermediate nodes/columns?

Have I missed something? Is there a way with AVRO/Parquet to obtain the simpler BigQuery table structure as with JSON, for nested arrays?

1
Don't know Parquet, but FYI logical types in Avro will be supported in BQ soon, issuetracker.google.com/issues/35905894.Hua Zhang
Thanks for the pointer about Avro logical types. I look forward to that. However, given Parquet's columnar nature, it seems a better fit for BigQuery. For simple test data I generated (with quite a lot of duplication in columns, which is typical of my data), Avro was about 25% the size of JSON, while Parquet was less than 1% the size of JSON!John Hurst
I noticed another fact about this problem, which seems to indicate it is particular to Parquet, or perhaps to AvroParquetWriter. When I write an Avro file using this Avro scheme, if I serialise it to JSON using Avro tools, I get the simple JSON back that I wanted (no 'array' node). When I write a Parquet file using AvroParquetWriter and the same Avro schema, and serialise back to JSON using Parquet tools, I get these extra 'array' nodes. Perhaps there is some option in AvroParquetWriter to control this. I will look into it.John Hurst
If you were able to solve your problem, Can you post an answer and accept it please?VictorGGl

1 Answers

0
votes

I used this avro schema:

{
  "name": "simple_interval",
  "type": "record",
  "fields": [
    {"name": "file_name", "type": "string"},
    {"name": "file_created", "type": {"type": "long", "logicalType": "timestamp-millis"}},
    {"name": "id", "type": "string"},
    {"name": "interval_length", "type": "int"},
    {"name": "days", "type": {"type":"record","name":"days_", "fields": [
          {"name": "interval_date", "type": {"type": "int", "logicalType": "date"}},
          {"name": "quality", "type": "string"},
          {"name": "values", "type": {"type":"record", "name":"values_","fields": [
                {"name": "interval", "type": "int"},
                {"name": "value", "type": "float"}
          ]}}
    ]}}
  ]
}

I created an empty avro file out of it and I ran the command:

bq load --source_format=AVRO <dataset>.<table-name> <avro-file>.avro 

When running bq show <dataset>.<table-name> I get the following:

 Last modified                    Schema                    Total Rows   Total Bytes   Expiration   Time Partitioning   Labels   kmsKeyName  
 ----------------- ----------------------------------------- ------------ ------------- ------------ ------------------- -------- ------------ 
  22 May 09:46:02   |- file_name: string (required)           0            0                                                                   
                    |- file_created: integer (required)                                                                                        
                    |- id: string (required)                                                                                                   
                    |- interval_length: integer (required)                                                                                     
                    +- days: record (required)                                                                                                 
                    |  |- interval_date: integer (required)                                                                                    
                    |  |- quality: string (required)                                                                                           
                    |  +- values: record (required)                                                                                            
                    |  |  |- interval: integer (required)                                                                                      
                    |  |  |- value: float (required)