1
votes

I have DynamoDB tables that are backed up into S3, and need a (fairly minimal) way to do bespoke queries on our data in a way that Dynamo doesn't support. Athena seems like a nice way to use our S3 backups as the source for our proto DW/BI tool.

The issue with this is that DyanmoDB backup to S3 stores each row in this format

{"Column1":{"n":"1234"},"Column2":{"n":"5678"},"Column3":{"s":"abcd"}}

This makes it difficult for Athena / Presto to read it. I can import the data using

CREATE EXTERNAL TABLE test_table (
    column1 struct<n:string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

But this is not ideal, as the 'number' is still considered a string. I have considered downloading the file, flattening the JSON and re-uploading it to a different place, BUT this is not valid JSON either. Only each row of the file is valid JSON.

Is there a way for Athena to read the data in the file so it imports the field types correctly? Or ideally is there a way that Athena can flatten the {"n":"1234"} into an int?

1

1 Answers

1
votes

I'm not convinced this is the most ideal answer, but this is one I've found.

Presto allows you to both extract json as well as cast values. So, instead of altering how the data is stored before reading from Athena, you can create the table and cast the columns into the correct format like so:

CREATE EXTERNAL TABLE IF NOT EXISTS tableToCreate (
    userid struct<s:string> ,
    timestamp struct<n:bigint> ,
    reason struct<s:string>
  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://some-place

then when you are querying, query on column.s or column.n

SELECT userid.s timestamp.n FROM tableToCreate WHERE reason.s = 'REASON'

You can read more about it here: https://prestodb.io/docs/current/functions/json.html