Update partitioned table schema on AWS Glue/Athena

Question

I have an event stream from Firehose to S3 which then gets crawled by Glue to periodically create new partitions and update the schema for a table to be queried in Athena.

Suddenly our queries started to fail with

HIVE_BAD_DATA: Error parsing field value for field 1: For input string: "11642224428"

So we suspect that one of the fields that Glue initially inferred as INT now has to be changed to BIGINT. For some reason, the crawler didn't do that automatically, so we need to fix it.

The obvious attempt was to edit the table schema from struct<...,field:int,...> to struct<...,field:bigint,...>, but doing so results in:

HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'my_column' in table 'my_table' is declared as type 'struct<...,field:bigint,...>', but partition 'partition_0=2019/partition_1=12/partition_2=31/partition_3=23' declared column 'my_column' as type 'struct<...,field:int,...>'.

So our conclusion is that we would have to update the schema for all partitions by hand as well, but could not find how to do so without scripting it ourselves.

How can we proceed? Can we somehow tell the Crawler to do this for us, or is there some AWS API that could help us finish this cleanly?

If not, would iterating over all partitions doing the schema update solve our problem?

Prabhakar Reddy Prabhakar Reddy · Accepted Answer · 2020-01-23T09:59:52

To fix this issue you need to enable "Update all new and existing partitions with metadata from the table" and rerun the crawler. It will then enforce the same schema as table across all the partitions. Refer to this to know more about this setting.

Update partitioned table schema on AWS Glue/Athena

2 Answers