What benefit does Firehose+S3 provide instead of using Athena directly on DynamoDb?

Question

In this article - https://aws.amazon.com/blogs/database/how-to-perform-advanced-analytics-and-build-visualizations-of-your-amazon-dynamodb-data-by-using-amazon-athena/:

Similarly this article - https://aws.amazon.com/blogs/database/simplify-amazon-dynamodb-data-extraction-and-analysis-by-using-aws-glue-and-amazon-athena/:

Why not use Athena to directly query into the DynamoDb?

Marcos Tomaz Marcos Tomaz · Accepted Answer · 2021-03-10T15:15:56

First of all, Athena cannot query directly to DynamoDB. In order to do so, you need to make data available in another location that can be identified as a valid data source by AWS Glue; The most common is actually S3 and Kinesis (due to performance and cost reasons), but there are other options as:

JDBC
Amazon RDS
MongoDB
Amazon DocumentDB
Kafka
(others options will be displayed according to the method you choose to map data)

For DynamoDb you must extract data from the desired table before it can be used. Or, as in the first example, use real-time streams.

Explaining each scenario.

First Scenario: Uses DynamoDb Streams directly connected to kinesis Firehouse which makes the data emitted by real-time DynamoDb streams available in S3. This way Athena could use S3 as a source for the data.

Second Scenario: Uses glue crawler to map data schema from DynamoDb and create a table in your Data Catalog containing the schema map of the object properties. And to extract data itself uses a glue job that points out to properties map table and extracts the data to S3, creating another table in your Data Catalog but this time pointing to S3, making it available for Athena to perform queries.

The DynamoDB data structure and storage are not optimized to perform relational queries as Athena expects, you could read more about it on DynamoDB docs.

What benefit does Firehose+S3 provide instead of using Athena directly on DynamoDb?

1 Answers