0
votes

We have some pipe delimited .txt reports coming into a folder in S3, on which we run Glue crawler to determine the schema and query in Athena.

The format of the report changed recently so there are two new columns in the middle.

Old files:

Columns A  B  C  D  E  F
Data    a1 b1 c1 d1 e1 f1

New files with extra "G" and "H" columns:

Columns A  B  G  H  C  D  E  F
Data    a2 b2 g2 h2 c2 d2 e2 f2

What we get in the table created by the crawler as seen in Athena:

Columns A  B  C  D  E  F  G  H    <- Puts new columns at the end. OK
Data    a1 b1 c1 d1 e1 f1         <- Correct for old data
Data    a2 b2 g2 h2       e2 f2   <- 4 columns incorrect and 2 missing

Is this some sort of bug in glue crawler, or is there a way to configure this so it puts the right data in the right columns (other than runnning a data cleaning script to transform the input files)?

1

1 Answers

1
votes

I think this is yet another case of Glue overpromising and underdelivering. As long as the data format is delimited text Glue will do the wrong thing if you add columns in the middle. Adding or removing (but not both) columns to the end works, but not in the middle. Athena does not support different columns for different partitions, so there is no way that Glue could make this work – but it makes it look like it can.

You will either have to rewrite the data, change to add the columns last, or switch to a different data format where the files contain enough metadata for this not to be a problem: JSON, Avro, or Parquet.

I would suggest you stop using Glue crawlers altogether, it looks like it's a general tool, but really solves few use cases. See https://stackoverflow.com/a/56439429/1109 for some suggestions on what to do instead.