11
votes

I am using Spark to write files to S3 in ORC format. Also using Athena to query this data.

I am using the following partition keys:

s3://bucket/company=1123/date=20190207

Once I execute the Glue crawler to run on the bucket everything works as expected except the types of the partitions keys.

The Crawler configures them in the catalog as String type instead of int

Is there a configuration to define the default type of the partition keys ?

I know it can be changed manually later and set the Crawler config to Add new columns only.

1
Have you found solution so far? Get stuck with the same problem: all my partition keys are of type int, but crawler discovers them as strings...ChernikovP
After the first run of the crawler I set the keys to integer type and change the Crawler config to Add new columns onlyAlex Stanovsky
It makes sense, but I was hoping that there's some configurable / programmatic option from the beginning to handle partition types discovery. This manual adjustment doesn't fit well into provisioning automation. Also Add new columns only option doesn't work well, when the schema is subjected to change once in a while, as it's so easy to forget this particular crawler config value...ChernikovP

1 Answers

8
votes

Glue crawlers always treat partition keys as type string and unfortunately there is no configuration option available to change this behavior.