3
votes

I am trying to load a list of parquet files into a BigQuery table, but I am getting an error:

bq --location=EU load --source_format=PARQUET project:Input.k_2017_11_new "gs://my_bucket/2017_11/11/*.parquet"

Waiting on bqjob_r557b5eb5986df8a0_0000016855915d09_1 ... (34s) Current status: DONE

BigQuery error in load operation: Error processing job 'project:bqjob_r557b5eb5986df8a0_0000016855915d09_1': Error while reading data, error message: incompatible types for field 'data.list.element.p': INT32 in Parquet vs. double in schema

I actually do not need the field that is causing the error, but cannot find a way to skip this column.

Is there a solution to this problem?

I have tried specifying the schema with a json file, and forcing this field to FLOAT, or INT64, STRING, but nothing works so far.

2

2 Answers

1
votes

I see you're using cloudShell to load from parquet to BigQuery. Try writing a schema file in JSON, copying or uploading it into your cloudShell instance, and calling the file after you give the SOURCE-TO-PATH parameter:

bq --location=EU load --source_format=PARQUET project:Input.k_2017_11_new "gs://my_bucket/2017_11/11/*.parquet" ./mySchema.json
0
votes

I had a similar issue when using python, where an additional column was created when trying to write to bq.

The LoadJobConfig "ignore Unknown Values" parameter fixed my issue, and can be passed as --ignore_unknown_values in the command-line