Read Headers from Data Source in an AWS Glue Job

Question

I have an AWS Glue job that reads from a data source like so:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev-data", table_name = "contacts", transformation_ctx = "datasource0")

But when I call .toDF() on the dynamic frame, the headers are 'col0', 'col1', 'col2' etc. and my actual headers are in the first row of the dataframe.

Note - I can't set them manually as the columns in the data source are variable & iterating over the columns in a loop to set them results in error because you'd have to set the same dataframe variable multiple times, which glue can't handle.

How might I capture the headers while reading from the data source?

How does the table look like in the Glue Catalog? If the underlying DataFrame has generic column names, probably your catalog entry has it too. Did you use crawler to populate the Catalog? — botchniaque
Just to verify you can call datasource0.printSchema() and datasource0.toDF().printSchema() but I doubt that they would not have same schema. — botchniaque
Yes, I used a crawler to populate the catalog. In the databases > tables it does show up with col0, col1 etc. Could the problem be with the crawler? AWS support said to just bypass the data source and consume the csv source straight from the s3 bucket (e.g. step 3 in docs.aws.amazon.com/glue/latest/dg/…) but I don't love that answer. — Tibberzz
Do you have a header row in your csv? If yes, then looks like crawler is not making use of it. If no, then how is crawler supposed to know what are you're column names. — botchniaque

Dheeraj Dheeraj · Accepted Answer · 2019-05-14T08:18:32

You can try withHeader param. e.g.

dyF = glueContext.create_dynamic_frame.from_options(
    's3',
    {'paths': ['s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv']},
    'csv',
    {'withHeader': True})

The documentation for this can be found here

Read Headers from Data Source in an AWS Glue Job

4 Answers