0
votes

I am currently building a datalake where I run AWS GlueJobs daily to copy data in our database and make them queryable via AWS Athena. Because the schema of the data I fetch changes often, I crawl them regularly with a Glue Crawler. Unfortunately, when I run the crawler two days in a row and the schema changes I get an error about incompatible schemas:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://***/raw/itemstore/parquet_flattened/v1/type=articles/year=2019/month=12/day=12/part-00012-13fc8243-cd4e-47b8-8763-56b15ea46e84-c000.snappy.parquet (offset=0, length=32745292): Schema mismatch, metastore schema for row column item__timeline.element has 10 fields but parquet schema has 9 fields

This query ran against the "***" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: ***

Here is the code for our crawler in cloud formation:

  ItemStoreCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: <A STRING>
      DatabaseName: !Ref DatabaseName
      Configuration: "{\"Version\": 1.0, \"CrawlerOutput\": {\"Partitions\": {\"AddOrUpdateBehavior\": \"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
      Role: !GetAtt CrawlerRole.Arn
      TablePrefix: String 
      Tags:
        Platform: !Ref Platform
        Maintainer: !Ref Maintainer
        ServerType: !Ref ServerType
        ServiceName: !Sub ${ProjectName}
        Environment: !Ref Environment

      Targets:
        S3Targets:
          - Path: String

My guess is that the schema merging behavior of my crawler is badly set up in the line starting with Configuration but I cannot find a fix.

1
Is your table partitioned and if yes have you tried running crawler by modifying settings to "update new and existing partitions with table metadata"?Prabhakar Reddy

1 Answers

0
votes

This is related to having it ignore the column order - I would strongly recommend against using a Glue Crawler - write the table directly to Athena using Glue as your Hive Metastore to avoid this.

https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html#summary-of-updates