AWS Glue custom crawler based on file name

Question

So what I am trying to do is to crawl data on S3 bucket with AWS Glue. Data stored as nested json and path looks like this:

s3://my-bucket/some_id/some_subfolder/datetime.json

When running default crawler (no custom classifiers) it does partition it based on path and deserializes json as expected, however, I would like to get a timestamp from the file name as well in a separate field. For now Crawler omits it.

For example if I run crawler on:

s3://my-bucket/10001/fromage/2017-10-10.json

I get table schema like this:

Partition 1: 10001
Partition 2: fromage
Array: JSON data

I did try to add custom classifier based on Grok pattern:

%{INT:id}/%{WORD:source}/%{TIMESTAMP_ISO8601:timestamp}

However, whenever I re-run crawler it skips custom classifier and uses default JSON one. As a solution obviously I could append file name to the JSON itself before running a crawler, but was wondering if I can avoid this step?

hoaxz hoaxz · Accepted Answer · 2017-12-13T14:19:41

Classifiers only analyze the data within the file, not the filename itself. What you want to do is not possible today. If you can change the path where the files land, you could add the date as another partition:

s3://my-bucket/id=10001/source=fromage/timestamp=2017-10-10/data-file-2017-10-10.json

AWS Glue custom crawler based on file name

1 Answers