AWS Glue crawler cannot recognize consistent CSV schema over historical files

Question

We have a folder of .csv and .ctl files. The CSVs are daily files, five in total per day, over a period of time. Their naming convention is a prefixed string followed by a date identifier (Eg: ABCDE090619.csv). The header row, for each of the five daily files, is consistent over time.

The expected behaviour of the Glue crawler is to recognize the five table schemas and create a row for day's data within each table. Instead, the crawler creates an individual schema for every single file. Roughly 550 in total.

Is there any mechanism which that could be driving this behaviour? Our considerations currently include the naming convention but according to the Glue docs, only the file schema should matter.

Thank you.

sheck97 sheck97 · Accepted Answer · 2019-09-13T17:23:59

Using the "Create a single schema for each S3 path" option for your crawler might help you. In the Console it's in the Output section of the crawler config under "Grouping behavior for S3 data."

Update: When using the option explained above, you must have files with different schemas separated into different folders. You can point the crawler at the root folder, but the folder structure tells it which files to group together.

AWS Glue crawler cannot recognize consistent CSV schema over historical files

1 Answers