4
votes

I have a number of items in an s3 path that I'm trying to crawl (using a root path of s3://my-bucket/somedata/)

s3://my-bucket/somedata/20180101/data1/stuff.txt.gz
s3://my-bucket/somedata/20180101/data2/stuff.txt.gz
s3://my-bucket/somedata/20180101/data1.sql
s3://my-bucket/somedata/20180101/data2.sql  
s3://my-bucket/somedata/20180102/data1/stuff.txt.gz
s3://my-bucket/somedata/20180102/data2/stuff.txt.gz
...

Sometimes we tables are named according to the date pattern (e.g. 20180101); sometimes they are named according to the leaf level 'folder' (e.g. data1), sometimes the file (e.g. data1.sql), and when there are conflicts it seems that Glue just appends a unique identifier to the table name (e.g. data1_c17b2f988649f2171b24b1d35da7f2b4).

What is the logic here? Are these names deterministic? Are there patterns I should use for structuring my data so that the crawler will catalog things in some logical order?

1
Any solution for this? I have all the parquet files with the same schema in a folder lets say A. My crawler is creating a table with weird suffix like A_d6gw2y3h83737hfj - Tula
@Tula my solution and guidance is "don't use Glue crawler" - at least in this way. I find it is much more trouble that it is worth. I do sometimes use it for schema discovery, but I will typically retrieve the schema and use it to create / name tables in a way that makes more sense. - Kirk Broadhurst

1 Answers

-2
votes

You need to standardize the path to get the name correctly e.g.

s3://my-bucket/Customer/Customer_20180101/customer.csv 
s3://my-bucket/Customer/Customer_20180102/customer.csv 
s3://my-bucket/Customer/Customer_20180103/customer.csv 
s3://my-bucket/Customer/Customer_20180104/customer.csv 
s3://my-bucket/Customer/Customer_20180105/customer.csv

Will load all the files in Customer table using Glue crawler, once you point the crawler to Customer folder on s3