7
votes

I have an AWS Glue job that reads from a data source like so:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev-data", table_name = "contacts", transformation_ctx = "datasource0")

But when I call .toDF() on the dynamic frame, the headers are 'col0', 'col1', 'col2' etc. and my actual headers are in the first row of the dataframe.

Note - I can't set them manually as the columns in the data source are variable & iterating over the columns in a loop to set them results in error because you'd have to set the same dataframe variable multiple times, which glue can't handle.

How might I capture the headers while reading from the data source?

4
How does the table look like in the Glue Catalog? If the underlying DataFrame has generic column names, probably your catalog entry has it too. Did you use crawler to populate the Catalog?botchniaque
Just to verify you can call datasource0.printSchema() and datasource0.toDF().printSchema() but I doubt that they would not have same schema.botchniaque
Yes, I used a crawler to populate the catalog. In the databases > tables it does show up with col0, col1 etc. Could the problem be with the crawler? AWS support said to just bypass the data source and consume the csv source straight from the s3 bucket (e.g. step 3 in docs.aws.amazon.com/glue/latest/dg/…) but I don't love that answer.Tibberzz
Do you have a header row in your csv? If yes, then looks like crawler is not making use of it. If no, then how is crawler supposed to know what are you're column names.botchniaque

4 Answers

2
votes

You can try withHeader param. e.g.

dyF = glueContext.create_dynamic_frame.from_options(
    's3',
    {'paths': ['s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv']},
    'csv',
    {'withHeader': True})

The documentation for this can be found here

2
votes

I know this post is old, but I just ran into a similar issue and spent way too long figuring out what the problem was. Wanted to share my solution in case it's helpful to others!

I was using the GUI on AWS and forgot to actually add the correct classifier to the crawler before running it. This resulted in AWS Glue incorrectly detecting datatypes (they mostly came out as strings) and the column names were not detected (they came out as col1, col2, etc). You can create the classifier in "classifiers" under "crawlers". Then, when setting up the crawler, add your classifier to the "selected classifiers" section at the bottom.

Documentation: https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html

1
votes

It turns out it's a bug in the glue crawler, they don't support headers yet. The workaround I used was to go through the motions of crawling the data anyways, then when the crawler completes, I have a lambda that triggers off of the crawler completion cloud watch event and the lambda kicks off the glue job that just reads directly from s3. When glue is fixed to support reading in the headers I can switch out how I read in the headers.

0
votes

I made few changes to read with header as following -

dyF = glueContext.create_dynamic_frame.from_options(
    's3',
    {'paths': ['s3://bucketname/key_to_csv_file']},
    format= 'csv',
    format_options= {'withHeader': True})