AWS Glue Crawler Cannot Extract CSV Headers

Question

At my wits end here...

I have 15 csv files that I am generating from a beeline query like:

beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv

I chose dsv because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does).

Anyway, I upload these 15 csv files to an s3 bucket and run my crawler.

Everything works great. For 14 of them.

Glue is able to extract the header line for every single file except one, naming the columns col_0, col_1, etc, and including the header line in my select queries.

Can anyone provide any insight into what could possibly be different about this one file that is causing this?

If it helps, I have a feeling that some of the fields in this csv file may, at some point, been encoded in UTF-16 or something. When I originally open it, there were some weird "?" characters floating around.

I've run tr -d '\000' on it in an effort to clean it up, but that could have not been enough.

Again, any leads, suggestions, or experiments I can run would be great. Btw, I would prefer if the crawler was able to do everything (ie: not needing to manually change the schema and turn off updates).

Thanks for reading.

Edit:

Have a feeling this has something to do with it source:

Every column in a potential header parses as a STRING data type.

Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.

Every column in a potential header must meet the AWS Glue regex requirements for a column name.

The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.

Hey, did you check if you header row had any empty header at last the line? — Brian Ocampo

Thom Lane Thom Lane · Accepted Answer · 2020-03-12T17:10:06

Adding a Custom Classifier fixed a similar issue of mine.

You can avoid header detection (which doesn't work when all columns are string type) by setting ContainsHeader to PRESENT when creating the custom classifier, and then provide the column names through Header. Once the custom classifier has been created you can assign this to the crawler. Since this is added to the crawler, you won't need to make changes to the schema after the fact, and don't risk these changes being overwritten in the next crawler run. Using boto3, it would look something like:

import boto3


glue = boto3.client('glue')

glue.create_classifier(CsvClassifier={
    'Name': 'contacts_csv',
    'Delimiter': ',',
    'QuoteSymbol': '"',
    'ContainsHeader': 'PRESENT',
    'Header': ['contact_id', 'person_id', 'type', 'value']
})

glue.create_crawler(Name=GLUE_CRAWLER,
                    Role=role.arn,
                    DatabaseName=GLUE_DATABASE,
                    Targets={'S3Targets': [{'Path': s3_path}]},
                    Classifiers=['contacts_csv'])

AWS Glue Crawler Cannot Extract CSV Headers

7 Answers