5
votes

I'm getting a schema mismatch error when querying parquet data from Athena.

The error is:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://datalake/googleanalytics/version=0/eventDate=2017-06-11/part-00001-9c9312f7-f9a5-47c3-987e-9348b78aaebc-c000.snappy.parquet (offset=0, length=48653579): Schema mismatch, metastore schema for row column totals has 13 fields but parquet schema has 12 fields

In the AWS Glue Crawler I tried enabling Update all new and existing partitions with metadata from the table which I thought would resolve this issue, however I'm still getting the above error. I did this because of the similar question: How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

The table schema for the totals column is:

struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,transactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScreen:bigint,totalTransactionRevenue:bigint,sessionQualityDim:bigint>

The parquet file for partition eventDate=2017-06-11 is missing the last field "sessionQualityDim".

1
is your parquet data partitioned based on eventDate column? If yes then that would not be a part of schema in athena table. Instead, while creating the table pass eventDate column in partition by parameter.Harsh Bafna
@HarshBafna Yes it is partitioned on eventDate. The table is created by a Glue crawler. The DDL for the table contains PARTITIONED BY ('eventdate' string)AndrewSC
looks like s3://datalake/googleanalytics/version=0/eventDate=2017-06-11/part-00001-9c9312f7-f9a5-47c3-987e-9348b78aaebc-c000.snappy.parquet file has only 12 columns in its schema as compared to 13 defined in the athena table. Use github.com/apache/parquet-mr/tree/master/parquet-tools to find the schema of that file.Harsh Bafna
please provide the output of "parquet-tools schema <your_parquet_file>" command.Harsh Bafna
@HarshBafna I don't have parquet-tools built at the moment. I have however used a Spark DataFrame.printSchema() to display the schema:AndrewSC

1 Answers

1
votes

You have parquet files with two different schema and the Athena table schema matches with the newer one. You can do one of the following :

1) Create two different tables in athena, one pointing to data till 2017 and other pointing to data post 2017.

2) In case the older data is no more valid for current use case, then you can simply archive that data and remove the 2017 and older partitions from your current table.