AWS Glue Crawler - Reading a gzip file of csv

Question

Can you please help me with reading a tar.gz file using Glue Data crawler please? I have a tar.gz file which contains couple of files in different schema in my S3, and when I try to run a crawler, I don't see the schema in the data catalogue. Should we use any custom classifiers? The AWS Glue FAQ specifies that gzip is supported using classifiers, but is not listed in the classifiers list provided in the Glue Classifier sections.

Ok I see no answers to this, so is it the way wherein we can use a lambda function to unzip / uncompress the files in a different s3 location, and point it to Glue data crawlers? Appreciate if there are any other simple way — Yuva
Do you mean tar? If so you need to gzip the files individually. — Steven Ensslen
@Yuva - have you find the solution for this ( direct support from glue), instead of using lambda? — java_dev

Stof Stof · Accepted Answer · 2018-04-23T00:58:11

According to the official AWS docs for Glue Crawler built in classifiers this functionality should be 100% supported and transparent.

https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html

A csv format compressed with gzip is built-in.

However i would suggest contacting AWS Support if it does not work as described for you.

AWS Glue Crawler - Reading a gzip file of csv

3 Answers