Can you please help me with reading a tar.gz file using Glue Data crawler please? I have a tar.gz file which contains couple of files in different schema in my S3, and when I try to run a crawler, I don't see the schema in the data catalogue. Should we use any custom classifiers? The AWS Glue FAQ specifies that gzip is supported using classifiers, but is not listed in the classifiers list provided in the Glue Classifier sections.
2
votes
Ok I see no answers to this, so is it the way wherein we can use a lambda function to unzip / uncompress the files in a different s3 location, and point it to Glue data crawlers? Appreciate if there are any other simple way
– Yuva
Do you mean tar? If so you need to gzip the files individually.
– Steven Ensslen
@Yuva - have you find the solution for this ( direct support from glue), instead of using lambda?
– java_dev
3 Answers
3
votes
According to the official AWS docs for Glue Crawler built in classifiers this functionality should be 100% supported and transparent.
https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html
A csv format compressed with gzip is built-in.
However i would suggest contacting AWS Support if it does not work as described for you.
0
votes
Did you check if the crawler can parse the file itself? Just create a sample file with few lines from the original file and then run the crawler to see if it can infer the schema. If not, may be you will need a custom classifier. Its true specially for space separated text files. You can also paste some sample lines here if thats okay with you.