How to handle Athena when there are schema changes in new data?

Question

I have a bunch of ORC files in S3 which are partition by their dates. The data are placed in folders of their dates in S3. However, recently, there are new columns added to the ORC files created.

In other words, the older ORC files may have ColumnA, ColumnB, ColumnC while the more recent ORC files (after a certain date) have ColumnA, ColumnB, ColumnC, ColumnD, ColumnE.

Because of this, the data Athena crawls can have errors complaining that ColumnD type in ORC is incompatible with the type in Athena table when queried against it.

Is there anything I can do in Glue or Athena so that when it crawls, it knows that the old ORC files with missing ColumnD and ColumnE will default to null or whatever value that is compatible with the type so that it doesn't keep throwing error when queried?
If (1) is not possible, what else can I do to get Athena to work on a set of data which has changes along the way in its schema?

Philipp Johannis Philipp Johannis · Accepted Answer · 2020-11-21T21:42:39

In general Athena supports schema evolution, there is a great overview from AWS on Handling Schema Updates. What's possible is highly dependent on your file format. In your case it is ORC, so Athena reads columns by index by default, so their order is important:

you can't add columns at the beginning or in the middle of the table
you can't remove or reorder columns
you can add columns at the end of the table

So in your case, where you just want to add two columns at the end of the table there shouldn't be any problem and you should be able to simply add these columns by using ALTER TABLE ADD COLUMN.

How to handle Athena when there are schema changes in new data?

1 Answers