How to map dynamic dynamoDB columns in EMR Hive

Question

I have a table in Amazon dynamoDB with a record structure like

{"username" : "joe bloggs" , "products" : ["1","2"] , "expires1" : "01/01/2013" , "expires2" : "01/02/2013"}

where the products property is a list of products belonging to the user and the expires n properties relate to the products in the list, the list of products is dynamic and there are many. I need to transfer this data to S3 in a format like

joe bloggs|1|01/01/2013
joe bloggs|2|01/02/2013

Using hive external tables I can map the username and products columns in dynamoDB, however I am unable to map the dynamic columns. Is there a way that I could extend or adapt the org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler in order to interpret and structure the data retrieved from dynamo before hive ingests it? or is there an alternative solution to convert the dynamo data to first normal form?

One of my key requirements is that i maintain the throttling provided by the dynamodb.throughput.read.percent setting so that I do not compromise operational use of the table.

You should also post this on the official DynamoDB forum (forums.aws.amazon.com/forum.jspa?forumID=131). Amazon employees respond to most of the posts there. — pw.
hi @stjohnroe, do you have any solution to this already? I'm following this question. Please let me know if you found any solutions. — Cliff Richard Anfone

Rodrigo Ribeiro Rodrigo Ribeiro · Accepted Answer · 2012-04-11T20:49:04

You could build a specific UDTF(User defined table-generating functions) for that case. I'm not sure how Hive handles asterisk(probably for your case) as an argument for the function.

Something like what Explode (source) does.

How to map dynamic dynamoDB columns in EMR Hive

1 Answers