Due to user error, our S3 directory over which a Glue crawler ran routinely became flooded with .csv files. When Glue ran over the S3 directory- it created a table for each of the 200,000+ csv files. I ran a script that deleted the .csv files shortly after (S3 bucket has versioning enabled), and re-ran the Glue crawler with the following settings:
Schema updates in the data store Update the table definition in the data catalog.
Inherit schema from table Update all new and existing partitions with metadata from the table.
Object deletion in the data store Delete tables and partitions from the data catalog.
Within the cloudwatch logs- it's updating the tables matching the remaining data, but it's not deleting any of the tables generated from those .csv files. According to it's configuration log on Cloudwatch- it should be able to do so.
INFO : Crawler configured with Configuration
{
"Version": 1,
"Grouping": {
"TableGroupingPolicy": "CombineCompatibleSchemas"
}
}
and SchemaChangePolicy
{
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "DELETE_FROM_DATABASE"
I should include there is another crawler that is set to crawl over the S3 bucket, but it's not been run in over a year, so I doubt that could be a point of conflict.
I'm are stumped on what could be the issue; as it stands, I can write a script to pattern match the existing tables and drop those with a csv in their name or delete and rebuild the database by having Glue re crawl S3, but if possible- I'd much rather Glue drops the tables itself after identifying they point to no files within S3 itself.