We have some pipe delimited .txt reports coming into a folder in S3, on which we run Glue crawler to determine the schema and query in Athena.
The format of the report changed recently so there are two new columns in the middle.
Old files:
Columns A B C D E F
Data a1 b1 c1 d1 e1 f1
New files with extra "G" and "H" columns:
Columns A B G H C D E F
Data a2 b2 g2 h2 c2 d2 e2 f2
What we get in the table created by the crawler as seen in Athena:
Columns A B C D E F G H <- Puts new columns at the end. OK
Data a1 b1 c1 d1 e1 f1 <- Correct for old data
Data a2 b2 g2 h2 e2 f2 <- 4 columns incorrect and 2 missing
Is this some sort of bug in glue crawler, or is there a way to configure this so it puts the right data in the right columns (other than runnning a data cleaning script to transform the input files)?