I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case:
We have an s3 bucket with a number of Avro files. We have decided to use Avro due to having extensive support for data schema changes overtime, allowing new fields to be applied to old data with no problem.
With AWS Glue, I understand that a new table is created by a crawler whenever there is a schema change. When our schema has changed, this has caused a number of new tables to be created by the crawler, as expected, but not quite as we desire...
Ultimately, we would like the crawler to detect the most recent schema and apply this schema to all the data that we are crawling in the s3 bucket, outputting only one table. We had (perhaps incorrectly) assumed that by using Avro, this would not be an issue as the crawler could apply new schema fields with a given default or null value to older data (the benefit of using Avro), and only output one table that we then could query using AWS Athena.
Is there a way in AWS Glue to use a given schema for all data in the s3 bucket, enabling us to leverage the Avro benefit of schema evolution, so that all data is output into one table?