What's wrong with this HIVE script to export from DynamoDB to S3 in an AWS Data Pipeline?

Question

Is there a problem with the HIVE script below or is this another issue, possibly related to the version of HIVE installed by AWS Data Pipeline?

The first part of my AWS Data Pipeline must export large tables from DynamoDB to S3 to later process using EMR. The DynamoDB table that I'm using for testing is only a few rows long, so I know that the data is formatted correctly.

The script associated with the AWS Data Pipeline "Export DynamoDB to S3" building block works correctly for tables that contain only primitive_types but don't export array_types. (reference - http://archive.cloudera.com/cdh/3/hive/language_manual/data-manipulation-statements.html)

I pulled out all Data Pipeline-specific stuff and am now trying to get the following minimal example based on the DynamoDB docs to work - (reference - http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html)

-- Drop table
DROP table dynamodb_table;

--http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html
CREATE EXTERNAL TABLE dynamodb_table (song string, artist string, id string, genres array<string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "InputDB",
"dynamodb.column.mapping" = "song:song,artist:artist,id:id,genres:genres");

INSERT OVERWRITE DIRECTORY 's3://umami-dev/output/colmap/' SELECT *
FROM dynamodb_table;

Here is the stack-trace / EMR errors that I'm see when running the above script -

Diagnostic Messages for this Task:
java.io.IOException: IO error in map input file hdfs://172.31.40.150:9000/mnt/hive_0110/warehouse/dynamodb_table
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:244)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:218)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:238)
... 9 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.scan(AbstractDynamoDBRecordReader.java:176)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.fetchItems(HiveDynamoDBRecordReader.java:87)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.next(HiveDynamoDBRecordReader.java:44)
at org.apache.hadoop.hive.dynamodb.read.HiveDynamoDBRecordReader.next(HiveDynamoDBRecordReader.java:25)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
... 13 more

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Command exiting with ret '255'

I tried a few things already to debug, but none of them have been successful - creating an external table w/formatting, using a few different JSON SerDes. I'm not sure what to try next.

Many thanks.

Yep. I tried that but still no luck. Thx. With 3.3.2 I am getting rows with string fields, but no array<string> fields. Three rows of output looks like this - 2The CureInbetween Days\N\n 1The CureBoys Dont Cry\N\n 3Depeche ModePimpf\N\n The "\N" is actually in the output and the "\n"s are real line-breaks. — Elliot

Elliot Elliot · Accepted Answer · 2015-03-12T13:26:04

I answered my own question by creating an EMR cluster and using Hue to quickly run HIVE queries in the Amazon environment.

The solution was to change the format of Items in the DynamoDB - what was originally a List of Strings is now a StringSet. Then my Hive tables could successfully operate on the array.

Logically speaking, I may lose the order of the Strings because I assume that a List is ordered but a Set is not. This doesn't matter in my specific problem.

Here's the relevant chunk of the final functioning Hive script -

-- depends on genres2 to be a StringSet (or not exist)
CREATE EXTERNAL TABLE sauce (id string, artist string, song string, genres2 array<string>)
STORED BY "org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler"
TBLPROPERTIES ("dynamodb.table.name" = "InputDB",
"dynamodb.column.mapping" = "id:id,artist:artist,song:song,genres2:genres2");

-- s3 location for export to
CREATE EXTERNAL TABLE pasta (id int, artist string, song string, genres array<string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '|'
LOCATION "s3n://umami-dev/tmp2";

-- do the export
INSERT OVERWRITE TABLE pasta
SELECT * FROM sauce;

What's wrong with this HIVE script to export from DynamoDB to S3 in an AWS Data Pipeline?

1 Answers