3
votes

Due to user error, our S3 directory over which a Glue crawler ran routinely became flooded with .csv files. When Glue ran over the S3 directory- it created a table for each of the 200,000+ csv files. I ran a script that deleted the .csv files shortly after (S3 bucket has versioning enabled), and re-ran the Glue crawler with the following settings:

Schema updates in the data store    Update the table definition in the data catalog.
Inherit schema from table   Update all new and existing partitions with metadata from the table.
Object deletion in the data store   Delete tables and partitions from the data catalog.

Within the cloudwatch logs- it's updating the tables matching the remaining data, but it's not deleting any of the tables generated from those .csv files. According to it's configuration log on Cloudwatch- it should be able to do so.

INFO : Crawler configured with Configuration 
{
    "Version": 1,
    "Grouping": {
        "TableGroupingPolicy": "CombineCompatibleSchemas"
    }
}
 and SchemaChangePolicy 
{
    "UpdateBehavior": "UPDATE_IN_DATABASE",
    "DeleteBehavior": "DELETE_FROM_DATABASE"

I should include there is another crawler that is set to crawl over the S3 bucket, but it's not been run in over a year, so I doubt that could be a point of conflict.

I'm are stumped on what could be the issue; as it stands, I can write a script to pattern match the existing tables and drop those with a csv in their name or delete and rebuild the database by having Glue re crawl S3, but if possible- I'd much rather Glue drops the tables itself after identifying they point to no files within S3 itself.

1
Is deleting those tables gonna be a one time operation?Ilya Kisil
What do you mean by a one time operation?CorerMaximus

1 Answers

3
votes

I'm currently taking the approach of writing a script to delete the tables created by Athena. All the generated files from Athena queries are 49 characters long, have five _ charachters for the results file and six _ for the metadata, and generally follow the format of ending in a _csv for the resulting query results, and _csv_metadata for the query metadata.

I'm getting a list of all the table names in my database, filtering it only include those that are 49 characters long, end with a _csv_metadata, and have six _ charachters within them. I'm iterating on each string and deleting their corresponding table in the database. For the resulting query that ends with _csv, I'm cutting of the trailing nine charachters of the the _csv_metadata string which cuts off _metadata.

If I were to improve on this, I'd also query the table and ensure it has no data in it and matches certain column name definitions.