How does BigQuery read the schema of a parquet file that is in Google Cloud Storage?

Question

I ask because I'm getting an error while loading a BigQuery table from parquet files that makes me think it's reading the mode of some fields incorrectly.

I'm attempting to load a parquet file from parquet to bigQuery from cloudShell:

loc1=gs://our-data/thisTable/model=firstmodel

bq --location=US load --noreplace --source_format=PARQUET our-data:theSchema.theTable $loc1/*.parquet ./ourSchema.json

There are ~30 parquet files in the directory referenced in loc1. I get an error that points to one of those specific files:

    BigQuery error in load operation: Error processing job 'our-data:bqjob_re73397ea395b9fd_0000016ae66ab746_1': Error while reading
data, error message: Provided schema is not compatible with the file 'part-00000-20b9e343-460b-44a8-b083-4437284d6771.c000.snappy.parquet'.
Field 'dataend' is specified as NULLABLE in provided schema which does not match REQUIRED as specified in the file.

However when I access the parquet file through spark and run printSchema(), the field comes up as NULLABLE:

root |-- row_id: long (nullable = true) |-- row_name: string (nullable = true) |-- dataend: string (nullable = true)

and the schema on the BigQuery table is NULLABLE, as is the appropriate section of the schema JSON:

I'd appreciate any help knowing where to look next.

Corinne White Corinne White · Accepted Answer · 2019-08-14T10:17:20

When Spark SQL writes parquet files it automatically converts all columns to NULLABLE for compatibility reasons.

You can inspect the parquet file itself using parquet-tools to double check if REQUIRED is set in the original file.

How does BigQuery read the schema of a parquet file that is in Google Cloud Storage?

1 Answers