0
votes

I'm trying to import data using an EMR job from JSON files in S3 that contain sparse fields e.g. an ios_os field and android_os but only one contains data. Sometimes the data is null and sometimes it's an empty string, when trying to insert into DynamoDB I'm getting an error (although I am able to insert some records that are sparsely populated):

"AttributeValue may not contain an empty string" {"created_at_timestamp":1358122714,...,"data":null,"type":"e","android_network_carrier":""}

I filtered out the columns that had the empty string "", but I'm still getting that error. I'm assuming it's the "property":null values that are causing this (or both). I assume that for it to work properly those values shouldn't exist when going to DynamoDB?

Is there any way to tell Hive through the JSONSerde or Hive's interaction with the DynamoDB table to ignore empty string attribute values.

Here's an example of the Hive SQL schema and insert command:

CREATE EXTERNAL TABLE IF NOT EXISTS json_events (
  -- Common
  created_at BIGINT,
  data STRING,
  type STRING,
  android_network_carrier STRING
)
PARTITIONED BY (created_at BIGINT, type STRING)
ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.JsonSerde"
WITH SERDEPROPERTIES (
  -- Common
  "created_at"="$.created_at",
  "data"="$.data",
  "android_network_carrier"="$.anw",
  "type"="$.dt"
)
LOCATION s3://test.data/json_events;

CREATE EXTERNAL TABLE IF NOT EXISTS dynamo_events (
  -- Common
  created_at BIGINT,
  data STRING,
  type STRING,
  android_network_carrier STRING
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("dynamodb.table.name" = "test-events", 
"dynamodb.column.mapping" = "created_at:created_at,data:data,type:type,android_network_carrier:android_network_carrier"); 
ALTER TABLE json_events RECOVER PARTITIONS;

INSERT OVERWRITE TABLE dynamo_events
SELECT created_at,
  data,
  android_network_carrier,
  type
FROM json_events 
WHERE created_at = 20130114 AND type = 'e';
1
Can you share your table definition and your query? I think I know what it is but need to see what you're trying to do. - Charles Menguy
I added a sample Hive SQL schema and insert statement that should represent what I'm trying to do. - Dougnukem

1 Answers

1
votes

The nulls shouldn't be a problem as long as it's not for the primary key.

However, DynamoDB does not allow empty strings nor empty sets as described in the data model.

To work around this, I think you have a couple options:

  1. Define a constant for empty strings like "n/a", and make sure that your data extraction processes treats missing values as such.
  2. You could also filter these records, but that will mean losing data. This could be done like this:

    INSERT OVERWRITE TABLE dynamo_events
    SELECT created_at,
      data,
      android_network_carrier,
      type
    FROM json_events 
    WHERE created_at = 20130114 AND type = 'e' AND android_network_carrier != "";