Google BigQuery cannot read some ORC data

Question

I'm trying to load ORC data files stored in GCS into BigQuery via bq load/bq mk and facing an error below. The data files copied via hadoop discp command from on-prem cluster's Hive instance version 1.2. Most of the orc-files are loaded successfully, but few are not. There is no problem when I read this data from Hive.

Command I used:

$ bq load --source_format ORC hadoop_migration.pm hive/part-v006-o000-r-00000_a_17 

Upload complete. 

Waiting on bqjob_r7233761202886bd8_00000175f4b18a74_1 ... (1s) Current status: DONE 

BigQuery error in load operation: Error processing job '<project>-af9bd5f6:bqjob_r7233761202886bd8_00000175f4b18a74_1': Error while reading data, error message: 

The Apache Orc library failed to parse metadata of stripes with error: failed to open /usr/share/zoneinfo/GMT-00:00 - No such file or directory

Indeed, there is no such file and I believe it shouldn't be.

Google doesn't know about this error message but I've found similar problem here: https://issues.apache.org/jira/browse/ARROW-4966. There is a workaround for on-prem servers of creating sym-link to /usr/share/zoneinfo/GMT-00:00. But I'm in a Cloud.

Additionally, I found that if I extract data from orc file via orc-tools into json format I'm able to load that json file into BigQuery. So I suspect that the problem not in the data itself.

Does anybody came across such problem?

Valerii Liutyi Valerii Liutyi · Accepted Answer · 2020-12-07T09:18:30

Official Google support position below. In short BigQuery doesn't understand some timezone's description and we suggested to change it in the data. Our workaround for this was to convert ORC data to parquet and then load it into table.

Indeed this error can happen. Also when you try to execute a query from the BigQuery Cloud Console such as:

select timestamp('2020-01-01 00:00:00 GMT-00:00')

you’ll get the same error. It is not just related to the ORC import, it’s how BigQuery understands timestamps. BigQuery supports a wide range of representations as described in [1]. So:

“2020-01-01 00:00:00 GMT-00:00”  -- incorrect timestamp string literal
“2020-01-01 00:00:00 abcdef”     -- incorrect timestamp string literal
“2020-01-01 00:00:00-00:00”      -- correct timestamp string literal

In your case the problem is with the representation of the time zone within the ORC file. I suppose it was generated that way by some external system. If you were able to get the “GMT-00:00” string with preceding space replaced with just “-00:00” that would be the correct name of the time zone. Can you change the configuration of the system which generated the file into having a proper time zone string?

Creating a symlink is only masking the problem and not solving it properly. In case of BigQuery it is not possible.

Best regards, Google Cloud Support

[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#time_zones

Google BigQuery cannot read some ORC data

1 Answers

Best regards, Google Cloud Support