0
votes

Trying to use AWS Glue to automatically crawl and catalogue JSON files in an S3 bucket as described here:

https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

Files smaller than 1mb are successfully catalogued however files greater than 1mb fail to be catalogued and are classified as Unknown.

Have tried approach listed here: AWS Glue Crawler Classifies json file as UNKNOWN

However makes no difference.

Would love to know if anyone's had similar issues?

1

1 Answers

1
votes

I have the same problem. Have you tried flattening the data into ORC or similar? There seems to be a limitation on nested JSON of a certain size, even with custom classifiers. Or you can change your JSON from

[
   { .... },
   { .....},
]

into just

{ ... }
{ ... }

Which should work in Glue.

This is the Python script I ran to get that transformation (worked with a 200 mb JSON):

import json
with open('./Data/data.json') as f:
    data = json.load(f)
with open('./Data/data_flat.json', 'w') as file:
    for entry in data['locations']:
        file.write(json.dumps(entry)+'\n')

Now glue correctly Classifies it!