1
votes

I'm familiarizing myself with crawlers in AWS Glue. I imported a database catalog from Athena, and would like to crawl the data locations of these tables daily to automatically update their partitions when data is added.

However, my crawlers only seem to create new tables, separate from the ones imported from Athena. They don't seem to update my existing tables. Is there any way to do this? Not seeing any mention of it in their docs.

3

3 Answers

1
votes

You may need to add a custom classifier whose job is to classify the data into separate tables in the data catalogue. You are probably using the default classifiers which do not know how to uniquely identify your schema.

What are classifiers: http://docs.aws.amazon.com/glue/latest/dg/add-classifier.html

1
votes

I have not tested this yet, but try to update the following fields in your imported table:

"CreatedBy": "arn:aws:sts::000000000000:assumed-role/YOUR_CLAWLER_ROLE/AWS-Crawler"
"Parameters": {
        "CrawlerSchemaDeserializerVersion": "1.0",
        "compressionType": "none",
        "UPDATED_BY_CRAWLER": "you_crawler_name_for_this_table",
        "CrawlerSchemaSerializerVersion": "1.0"
    }

I have skipped properties which are not related to crawler. The idea is to update your table that it looks like "created by crawler". May be after this crawler will updated it. :)

To get full table definition use get-table but keep inmind that this output has little difference from update-table

It would be nice if post your results, because I can not try this in nearest time. :(

Hope it helps.

1
votes

All you have to do, is to set UPDATED_BY_CRAWLER to the name of your crawler, and the crawler will pick it up from the next time. Please note that if you have any custom fields defined, they will be removed by the crawler.