Export Dynamodb to S3 using Hive

Question

I referred to this link: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMR_Hive_Commands.html.

My hive script is like below:

DROP TABLE IF EXISTS hiveTableName;
CREATE EXTERNAL TABLE hiveTableName (item map<string,string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("dynamodb.table.name" = "test_table", "dynamodb.region"="us-west-2");  

DROP TABLE IF EXISTS s3TableName;
CREATE EXTERNAL TABLE s3TableName (item map<string, string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION 's3://bucket/test-hive2'; 

SET dynamodb.throughput.read.percent=0.8;   

INSERT OVERWRITE TABLE s3TableName SELECT * 
FROM hiveTableName;

Dynamodb table can be successfully exported to S3, but the file format is not JSON, it is like:

uuid{"s":"db154955-8555-4b49-bf40-ee36605ac510"}num{"n":"1294"}info{"s":"qwefjdkslafjdafl"}
uuid{"s":"d9898564-2b56-42ba-9cfb-fd092e7d0b8d"}num{"n":"100"}info{"s":"qwefjdkslafjdafl"}

Does someone know how to export in JSON format? I know I can use Data Pipeline, and it can export dynamodb table to S3 in JSON format, but for some reason I need to use EMR. I tried another tool: https://github.com/awslabs/emr-dynamodb-connector, and use the command:

java -cp target/emr-dynamodb-tools-4.2.0-SNAPSHOT.jar org.apache.hadoop.dynamodb.tools.DynamoDBExport /where/output/should/go my-dynamo-table-name

but the error was

Error: Could not find or load main class org.apache.hadoop.dynamodb.tools.DynamoDBExport

Can someone tell me how to solve these problems? Thanks.

== update ==

If I use to_json, as Chris suggested, my code is as below:

DROP TABLE IF EXISTS hiveTableName2;
CREATE EXTERNAL TABLE hiveTableName2 (item map<string, string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("dynamodb.table.name" = "test_table", "dynamodb.region"="us-west-2");  

DROP TABLE IF EXISTS s3TableName2;
CREATE EXTERNAL TABLE s3TableName2 (item string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION 's3://backup-restore-dynamodb/hive-test'; 

INSERT OVERWRITE TABLE s3TableName2 SELECT to_json(item)
FROM hiveTableName2;

When I look at the generated file, it's like

{"uuid":"{\"s\":\"db154955-8555-4b49-bf40-ee36605ac510\"}","num":"{\"n\":\"1294\"}","info":"{\"s\":\"qwefjdkslafjdafl\"}"}

What I want is a nested map, like

map<string, map<string, string>>

not

map<string, string>

Can someone give me some suggestions? Thanks.

Chris Nauroth Chris Nauroth · Accepted Answer · 2017-07-20T06:04:48

Your SELECT * query is emitting a serialized form of the Hive map, which isn't guaranteed to be JSON. You may want to consider using the Brickhouse Hive UDF's. In particular, calling the to_json function would be a good fit for guaranteeing a JSON format in your output.

to_json -- Convert an arbitrary Hive structure ( list,map, named_struct ) into JSON

INSERT OVERWRITE TABLE s3TableName SELECT to_json(item) 
FROM hiveTableName;

Export Dynamodb to S3 using Hive

2 Answers