Header files for AWS Glue Data Catalog

Question

I have some data in s3 that I want to use AWS Glue to crawl and store in a Data Catalog. The problem I have is the data itself does not have headers rows. Instead, there is a separate header file ("header.csv"). Is there a way that I can tell AWS Glue to use the header.csv file to get the column names? Otherwise, the Data Catalog will show the column names as "col0", "col1",... "coln".

i.e. I have the following data:

s3://bucket/data/animals/header.csv

"id","animaltype","age"

s3://bucket/data/animals/data.csv

"1","cat","5"
"2","dog","2"
"3","otter","7"

Sandeep Fatangare Sandeep Fatangare · Accepted Answer · 2019-01-16T07:56:17

I am afraid there is no way for crawler to take header info from another file.

However you may write glue job to rename columns.

df = dyf.toDF()
oldColumns = df.schema.names
newColumns = #cols from header file
df = reduce(lambda df, idx: df.withColumnRenamed(oldColumns[idx], newColumns[ idx]), xrange(len(oldColumns)), df)

Header files for AWS Glue Data Catalog

1 Answers