3
votes

So what I am trying to do is to crawl data on S3 bucket with AWS Glue. Data stored as nested json and path looks like this:

s3://my-bucket/some_id/some_subfolder/datetime.json

When running default crawler (no custom classifiers) it does partition it based on path and deserializes json as expected, however, I would like to get a timestamp from the file name as well in a separate field. For now Crawler omits it.

For example if I run crawler on:

s3://my-bucket/10001/fromage/2017-10-10.json

I get table schema like this:

  • Partition 1: 10001
  • Partition 2: fromage
  • Array: JSON data

I did try to add custom classifier based on Grok pattern:

%{INT:id}/%{WORD:source}/%{TIMESTAMP_ISO8601:timestamp}

enter image description here

However, whenever I re-run crawler it skips custom classifier and uses default JSON one. As a solution obviously I could append file name to the JSON itself before running a crawler, but was wondering if I can avoid this step?

1

1 Answers

3
votes

Classifiers only analyze the data within the file, not the filename itself. What you want to do is not possible today. If you can change the path where the files land, you could add the date as another partition:

s3://my-bucket/id=10001/source=fromage/timestamp=2017-10-10/data-file-2017-10-10.json