Specify a SerDe serialization lib with AWS Glue Crawler

Question

Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe, which doesn't classify correctly (e.g. for quoted fields with commas in)

I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde.

I've tried making my own csv Classifier but that doesn't help.

How do I get the crawler to specify a particular serialization lib for the tables produced or updated?

Can you confirm if your data has double quotes in it? If yes crawler should populate table with OpenCSVSerde when custom CSV classifier used.Can you share the custom csv classifier config that you have used? In my case crawler populated table with OpenCSVSerde — Prabhakar Reddy
@Prabhakar The data does have does have double quotes for some rows, but not all. The ones with newlines in are double quoted. Do you think that editing the data so that all lines are double quoted is a solution? The custom classifier was just created with the Glue web interface; there don't seem to be many customization options there. — Luigi Plinge
@RhysJonesa.k.a.Luigi Did you find a solution for this other than updating the table definition either manually or programmatically? — chandu
@chandu Not a solution but we turned off the glue crawlers from auto-running since schema changes are rare. If you find a solution, let me know! — Luigi Plinge

Woody Woody · Accepted Answer · 2020-08-13T15:52:14

You can't specify the SerDe in the Glue Crawler at this time but here is a workaround...

Create a Glue Crawler with the following configuration.

Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog

Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions.
Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde".
Re-run the crawler.
In case a new partition is added on crawler re-run, it will also be created with “org.apache.hadoop.hive.serde2.OpenCSVSerde”.
You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset.

Specify a SerDe serialization lib with AWS Glue Crawler

1 Answers