At my wits end here...
I have 15 csv files that I am generating from a beeline query like:
beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv
I chose dsv
because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does).
Anyway, I upload these 15 csv files to an s3 bucket and run my crawler.
Everything works great. For 14 of them.
Glue is able to extract the header line for every single file except one, naming the columns col_0
, col_1
, etc, and including the header line in my select queries.
Can anyone provide any insight into what could possibly be different about this one file that is causing this?
If it helps, I have a feeling that some of the fields in this csv file may, at some point, been encoded in UTF-16 or something. When I originally open it, there were some weird "?" characters floating around.
I've run tr -d '\000'
on it in an effort to clean it up, but that could have not been enough.
Again, any leads, suggestions, or experiments I can run would be great. Btw, I would prefer if the crawler was able to do everything (ie: not needing to manually change the schema and turn off updates).
Thanks for reading.
Edit:
Have a feeling this has something to do with it source:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.