Updating manually created aws glue data catalog table with crawler

Question

I'm working with AWS glue and many files on s3, with new files appended every day. I try to create and run a crawler to deduce a schema of those csv files. Instead of just one data catalog table with schema, crawler creates many tables (even with Create a single schema for each S3 path option selected), which means that crawler recognize different schemas and can't combine them into one. But I need just one table in data catalog for all those files!

So I created separate data catalog table manually, and when I use this table with glue job, none of the s3 csv files are processed. I guess that is because every time crawler runs, it checks for new files and partitions (and in good case of single schema table we can see those files and partitions by clicking on View partitions button in Tables).

So in here there is way to update manually created table with a crawler, I followed it with a hope that crawler will not change data types for columns that I selected, but update list of files and partitions for glue job to process later:

You might want to create AWS Glue Data Catalog tables manually and then keep them updated with AWS Glue crawlers. Crawlers running on a schedule can add new partitions and update the tables with any schema changes. This also applies to tables migrated from an Apache Hive metastore.

To do this, when you define a crawler, instead of specifying one or more data stores as the source of a crawl, you specify one or more existing Data Catalog tables. The crawler then crawls the data stores specified by the catalog tables. In this case, no new tables are created; instead, your manually created tables are updated.

It doesn't happen for some reason, in crawler log I see this:

INFO : Some files do not match the schema detected. Remove or exclude the following files from the crawler (truncated to first 200 files): bucket1/customer/dt=2020-02-26/delta_20200226_080101.csv INFO : Multiple tables are found under location bucket1/customer/. Table customer is skipped.

But there is no "Exclude patterns" option to exclude that file when crawler uses existing data catalog table, documentation says that in this case "The crawler then crawls the data stores specified by the catalog tables".

And crawler doesn't add any partitions or files to my table.

Is there a way to update my manually created table with new files from s3?

Emerson Emerson · Accepted Answer · 2020-03-28T19:21:39

Considering your crawler is detecting different schemas, it will continue to do the same no matter what option I choose. You can get it to use the table definition from the table for all the partitions and then only log changes to avoid updating the table schema. But if there is a difference in schema for the files , I’m not sure if your queries will work.

Another option would be to add partitions using boto3 for your s3 path. I can get the table schema using the get table function and then create a partition in glue with that table schema

Updating manually created aws glue data catalog table with crawler

2 Answers