I'm using an AWS Glue Crawler to crawl a rough 170 GB of avro data to create a Data Catalog table.
There are a couple different schema versions in the avro data but the crawler still manages to combine the data into a single table (I have enabled the "Group by data compatibility and schema similarity - mode").
Here is when things get problematic.
I can only use Athena to run a SELECT COUNT(*) FROM <DB>.<TABLE> query on the data - any other query raises the following error:
GENERIC_INTERNAL_ERROR: Unknown object inspector category: UNION
A brief Google check leads me to believe that this has something to do with the schema in the avro files.
Normally, this is where is would focus my effort BUT: I have been able to do this exact same procedure(AVRO -> crawler -> Glue job -> PARQUET) before, with a smaller avro data set (50GB) having the same issue(only being able to run a count query). Moving on.
The the conversion job previously took about an hour. Now, when running the same job on the 170 GB data, the job finishes in a minute because glueContext.create_dynamic_frame.from_catalog now returns an empty frame - no errors, no nothing. The confusion is real as I am able to run a COUNT query in Athena on the same table that the job is using, returning a count of 520M objects.
Does anyone have an idea what the problem might be?
A couple of things that might be relevant:
- The COUNT query returns 520M but the
recordCountin table properties says 170M records. - The data is stored in 300k .avro files with size 2MB-30MB
- Yes, the crawler is pointed to the folder with all the files, not to a file (common crawler gotcha).
- The previous attempt with a smaller data set(50 GB) was 100% successful - I could crawl the parquet data and query it with Athena (tested many different queries, all working)