2
votes

I am trying to crawl some files having different sachems(Data compatible ) using AWS Glue.
As I read in the AWS documentation that Glue crawlers update the catalog tables for any change in the schema(add new columns and remove missing columns). I have checked the "Update the table definition in the Data Catalog" and "Create a single schema for each S3 path" while creating the crawler.
Example:
let's say I have a file "File1.csv" as shown below:

name,age,loc

Ravi,12,Ind

Joe,32,US

Say I have another file "File2.csv" as shown below:

name,age,height

Jack,12,160

Jane,32,180

After crawlers run in the schema was updated as: name,age,loc,height -This is as expcted but When I tried to read the files using Athena or tried writing the content of both the files to csv using Glue ETL job,I have observed that: the output looks like:

name,age,loc,height

Ravi,12,Ind,,

Joe,32,US,,

Jack,12,160,,

Jane,32,180,,

last two rows should have blank for loc as the second file didn't have loc column.

where as expected:

name,age,loc,height

Ravi,12,Ind,,

Joe,32,US,,

Jack,12,,160

Jane,32,,180

In short glue is trying to fill up the column in contiguous manner in the combined output.Is there any way I can get the expected output?

1

1 Answers

3
votes

I got the expected output with Parquet files. Initially, I was using CSV, but csv deserializer doesn't understand how to put the elements into the correct position when schema changes. Changing the individual csvs into parquet and then crawling them one after another helped me in incorporating the changing schema.