1
votes

I've crawled a couple of XML files on S3 using AWS Glue, using a simple XML classifier:

enter image description here

However, when I try running any query on that data using AWS Athena, I get the following error (note that it's the simplest possible query I'm doing here):

HIVE_UNKNOWN_ERROR: Unable to create input format

enter image description here

Note that Athena can see my tables and it can see the columns, it just can't query them:

enter image description here

 

  • I know there is a similar question here about this error but the query in question targeted an RDS database, unlike an S3 bucket like I have here.

Has anyone got a solution for this?

1
can you please share the xml file so that I can replicate the same behavior at my end - Tanveer Uddin
Yes. Here is a sample: pastebin.com/8yLmteZX - Felipe
Hi Felipe, Looks like Athena does not support XML files although glue crawler nicely created external table for you. aws.amazon.com/athena/faqs/… does not show xml as supported format. You may have to convert the xml into json or parquet so that you can query it from Athena. - Tanveer Uddin

1 Answers

2
votes

Sadly at this time 12/2018 Athena cannot query XML input which is hard to understand when you may hear that Athena along with AWS Glue can query xml.

What output you are seeing from the AWS crawler is correct though, just not what you think its doing! For example after your crawler has run and you see the tables, but cannot execute any Athena queries. Go into your AWS Glue Catalog and at the right click tables, click your table, edit properties it will look something like this: enter image description here

Notice how input format is null? If you have any other tables you can look at their properties or refer back to the input formatters documentation for Athena. This is the error you recieve.

Solutions:

  1. convert your data to text/json/avro/other supported formats prior to upload
  2. create a AWS glue job which converts a source to target from xml to target supported Athena format(compressed hopefully with ORC/Parquet)