I have a bunch of ORC files in S3 which are partition by their dates. The data are placed in folders of their dates in S3. However, recently, there are new columns added to the ORC files created.
In other words, the older ORC files may have ColumnA, ColumnB, ColumnC
while the more recent ORC files (after a certain date) have ColumnA, ColumnB, ColumnC, ColumnD, ColumnE
.
Because of this, the data Athena crawls can have errors complaining that ColumnD
type in ORC is incompatible with the type in Athena table when queried against it.
- Is there anything I can do in Glue or Athena so that when it crawls, it knows that the old ORC files with missing
ColumnD
andColumnE
will default to null or whatever value that is compatible with the type so that it doesn't keep throwing error when queried? - If (1) is not possible, what else can I do to get Athena to work on a set of data which has changes along the way in its schema?