1
votes

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://exp-mahesh-sandbox/Demo/Year=2017/Month=1/Day=3/part-00015-d0e1263a-616e-435f-b4f4-9154afb3f07d.c000.snappy.parquet (offset=0, length=12795): Schema mismatch, metastore schema for row column statistical has 17 fields but parquet schema has 9 fields

I have used AWS Glue crawler to get the schema of the Parquet files. Initially I am having few files in the partition Day=1 and Day=2, run crawler and able to query it using Athena. After adding few more files in the partition Day=3, where the schema of file with "statistical"(type:struct) column has some missing fields, Athena throws the above mentioned error. Is there any way to solve this issue. I am expecting null value in the missing fields.

I have tried UPDATE THE TABLE DEFINITION IN THE DATA CATALOG option in the crawler, but it gives the same result.

Crawler Settings

1

1 Answers

2
votes

You're getting that error because at least one of your Parquet files has a schema that is either different from the other files that compose the table or from the table's definition itself; it appears to be your "Day=3" partition.

This is a limitation in Athena, that requires that the files that are the data source for a table have the same schema, i.e. all the files' columns need to match Athena's table definition, even struct members.

This error happens despite the Glue crawler running successfully; the table definition is indeed updated by the crawler, but when you execute a query that touches a file with a different schema (e.g. missing a column) you get a HIVE_CANNOT_OPEN_SPLIT error.