1
votes

We have an emr server which is running Hive 0.13.1 ( I know very archaic it is , but this cluster has a lot of dependencies due to which we are not able to do away with it ) Anyway cutting the chase , we processed something like 10 TB of TSV data in parquet using a different emr cluster which has a latest version of hive. This was a temporary thing to faciliate the hugh data processing.

Now we are back to the old emr to do incremental processing of TSV to parquet. We are using aws redshift spectrum coupled with glue to do query on this data. Glue crawls the s3 path where data resides thereby giving us a schema to work with.

Now the data which has been processed by the old emr gives us issues regarding has an incompatible Parquet schema.

The error we get when we try to read parquet data comprising of data processed by newer hive and old hive is ,

[2018-08-13 09:40:36] error: S3 Query Exception (Fetch) [2018-08-13 09:40:36] code: 15001 [2018-08-13 09:40:36] context: Task failed due to an internal error. File '<Some s3 path >/parquet/<Some table name>/caldate=2018080900/8e71ebbe-b398-483c-bda0-81db6f848d42-000000 has an incompatible Parquet schema for column [2018-08-13 09:40:36] query: 11500732 [2018-08-13 09:40:36] location: dory_util.cpp:724 [2018-08-13 09:40:36] process: query1_703_11500732 [pid=5384]

My hunch is it is because of the different hive version or it could be a redshift spectrum bug.

Does anyone has faced the same issue ?

1

1 Answers

0
votes

I think this particular post will help you in solving this issue. It talks about the issues where schema is written by a different version and read by another version.

https://slack.engineering/data-wrangling-at-slack-f2e0ff633b69