0
votes

I'm using an AWS Glue Crawler to crawl a rough 170 GB of avro data to create a Data Catalog table.

There are a couple different schema versions in the avro data but the crawler still manages to combine the data into a single table (I have enabled the "Group by data compatibility and schema similarity - mode").

Here is when things get problematic.

I can only use Athena to run a SELECT COUNT(*) FROM <DB>.<TABLE> query on the data - any other query raises the following error:

GENERIC_INTERNAL_ERROR: Unknown object inspector category: UNION

A brief Google check leads me to believe that this has something to do with the schema in the avro files.

Normally, this is where is would focus my effort BUT: I have been able to do this exact same procedure(AVRO -> crawler -> Glue job -> PARQUET) before, with a smaller avro data set (50GB) having the same issue(only being able to run a count query). Moving on.

The the conversion job previously took about an hour. Now, when running the same job on the 170 GB data, the job finishes in a minute because glueContext.create_dynamic_frame.from_catalog now returns an empty frame - no errors, no nothing. The confusion is real as I am able to run a COUNT query in Athena on the same table that the job is using, returning a count of 520M objects.

Does anyone have an idea what the problem might be?

A couple of things that might be relevant:

  • The COUNT query returns 520M but the recordCount in table properties says 170M records.
  • The data is stored in 300k .avro files with size 2MB-30MB
  • Yes, the crawler is pointed to the folder with all the files, not to a file (common crawler gotcha).
  • The previous attempt with a smaller data set(50 GB) was 100% successful - I could crawl the parquet data and query it with Athena (tested many different queries, all working)
1

1 Answers

1
votes

We had the same issue and could solve it as follows.

In our avro schema there was a record with mixed field types, i.e., some were of the form "type" : [ "string" ], others of the form "type" : [ "null", "string" ].

Changing this manually to [ "null", "string" ] everywhere, we were able to use the table in Athena without any issues.