I am using BigQuery to query an external data source (also known as a federated table), where the source data is a hive-partitioned parquet table stored in google cloud storage. I used this guide to define the table.
My first query to test this table looks like the following
SELECT * FROM my_dataset.my_table WHERE year=2019 AND day = "2019-01-01" LIMIT 10
This query fails with the following error
Column visitor_partition has type INT64, which differs from the expected type of INT32
I want to know why this error occurs and how to work around it or solve it. I have performed a bit of investigation using the parquet-tools
library to peek inside the internals of my parquet data.
When I run java -jar ./parquet-tools-1.10.0.jar meta test.c000.gz.parquet | grep visitor_partition
over one of my parquet files, the following relevant lines are returned
visitor_partition: OPTIONAL INT64 R:0 D:1
visitor_partition: INT64 GZIP DO:0 FPO:59420041 SZ:54561/537912/9.86 VC:633590 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 0, max: 99, num_nulls: 0]
When I run the schema, command, the following relevant schema info comes up optional int64 visitor_partition;
So its clear that in this parquet file, the data for the visitor_partition
field is represented as an INT64. So why does BigQuery expect to see an INT32 type here?