8
votes

At my wits end here...

I have 15 csv files that I am generating from a beeline query like:

beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv

I chose dsv because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does).

Anyway, I upload these 15 csv files to an s3 bucket and run my crawler.

Everything works great. For 14 of them.

Glue is able to extract the header line for every single file except one, naming the columns col_0, col_1, etc, and including the header line in my select queries.

Can anyone provide any insight into what could possibly be different about this one file that is causing this?

If it helps, I have a feeling that some of the fields in this csv file may, at some point, been encoded in UTF-16 or something. When I originally open it, there were some weird "?" characters floating around.

I've run tr -d '\000' on it in an effort to clean it up, but that could have not been enough.

Again, any leads, suggestions, or experiments I can run would be great. Btw, I would prefer if the crawler was able to do everything (ie: not needing to manually change the schema and turn off updates).

Thanks for reading.

Edit:

Have a feeling this has something to do with it source:

Every column in a potential header parses as a STRING data type.

Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.

Every column in a potential header must meet the AWS Glue regex requirements for a column name.

The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.

7
Hey, did you check if you header row had any empty header at last the line?Brian Ocampo

7 Answers

9
votes

Adding a Custom Classifier fixed a similar issue of mine.

You can avoid header detection (which doesn't work when all columns are string type) by setting ContainsHeader to PRESENT when creating the custom classifier, and then provide the column names through Header. Once the custom classifier has been created you can assign this to the crawler. Since this is added to the crawler, you won't need to make changes to the schema after the fact, and don't risk these changes being overwritten in the next crawler run. Using boto3, it would look something like:

import boto3


glue = boto3.client('glue')

glue.create_classifier(CsvClassifier={
    'Name': 'contacts_csv',
    'Delimiter': ',',
    'QuoteSymbol': '"',
    'ContainsHeader': 'PRESENT',
    'Header': ['contact_id', 'person_id', 'type', 'value']
})

glue.create_crawler(Name=GLUE_CRAWLER,
                    Role=role.arn,
                    DatabaseName=GLUE_DATABASE,
                    Targets={'S3Targets': [{'Path': s3_path}]},
                    Classifiers=['contacts_csv'])
4
votes

I was having the same issue where Glue does not recognize the header row when all columns are Strings

I found that adding a new column on the end with an integer solves the problem

id,name,extra_column sdf13,dog,1

0
votes

If the csv is produced by pandas and the problem is all columns being strings, you can add index_label='row_number' to the to_csv call to make pandas create the extra column for you (without index_label pandas still prints the index, but not a header, which still confuses the crawler).

0
votes

Glue header identifier is fragile. Make sure the column names are valid SQL names (i.e. no spaces) and there're no empty column names (happens often when exporting from excel)

0
votes

This worked for me,

Classifier name: 'your classifier name'

Classifier type: 'csv'

Column delimiter: ',' (change it based on your preference)

Quote symbol: '"' (should be different from the column delimiter)

Column headings:

  • set this to, 'No headings'
  • your custom column names, eg: userid,username,userphone,useremail

Sample Example of Custom Classfier

-1
votes

Yes you are correct about the header part where if the CSV file has all of string data then header will be also considered as string and not as header. As a work around try putting property 'skip.header.line.count'='1' in table properties.

Regarding "?" you should use hex editor to view those invalid characters and to remove them from file.

-2
votes

probably last file you mentioned might have first index enabled. While saving change it to

df.to_csv('./filenam.csv',index=False)