I'm working with AWS glue and many files on s3, with new files appended every day. I try to create and run a crawler to deduce a schema of those csv files. Instead of just one data catalog table with schema, crawler creates many tables (even with Create a single schema for each S3 path option selected), which means that crawler recognize different schemas and can't combine them into one. But I need just one table in data catalog for all those files!
So I created separate data catalog table manually, and when I use this table with glue job, none of the s3 csv files are processed. I guess that is because every time crawler runs, it checks for new files and partitions (and in good case of single schema table we can see those files and partitions by clicking on View partitions button in Tables).
So in here there is way to update manually created table with a crawler, I followed it with a hope that crawler will not change data types for columns that I selected, but update list of files and partitions for glue job to process later:
You might want to create AWS Glue Data Catalog tables manually and then keep them updated with AWS Glue crawlers. Crawlers running on a schedule can add new partitions and update the tables with any schema changes. This also applies to tables migrated from an Apache Hive metastore.
To do this, when you define a crawler, instead of specifying one or more data stores as the source of a crawl, you specify one or more existing Data Catalog tables. The crawler then crawls the data stores specified by the catalog tables. In this case, no new tables are created; instead, your manually created tables are updated.
It doesn't happen for some reason, in crawler log I see this:
INFO : Some files do not match the schema detected. Remove or exclude the following files from the crawler (truncated to first 200 files):
bucket1/customer/dt=2020-02-26/delta_20200226_080101.csv
INFO : Multiple tables are found under location bucket1/customer/. Table customer is skipped.
But there is no "Exclude patterns" option to exclude that file when crawler uses existing data catalog table, documentation says that in this case "The crawler then crawls the data stores specified by the catalog tables".
And crawler doesn't add any partitions or files to my table.
Is there a way to update my manually created table with new files from s3?