Parsing JSON arrays (escaped strings)

Question

I'm a total Hive noob, trying to parse escaped strings of JSON arrays with a Hive query on EMR. Each line returned from my S3 bucket looks like this:

["{\"code\": 14468278120, \"book_id\": \"rUkt3UHYyumTy7KCr7T72y\", \"description\": \”fantasyl\"}", "{\"code\": 14468304270, \"book_id\": \"wyyoenAWvmg3HhpZqVQr8i\", \"description\": \”sci fi\”}”, "{\"code\": 14468310080, \"book_id\": \"4UyygAduLhs3qyHTyJLZiD\", \"description\": \”history\”}”, "{\"code\": 14468313170, \"book_id\": \"6KPnPiTBViZJeG3eguUZig\", \"description\": \”architecture\”}”, "{\"code\": 14468313650, \"book_id\": \"U5zAirv3HYvGYANkmb64Yy\", \"description\": \”design\”}”]

So far, any attempts with Regex Serde and JSON SerDe to get anything out of these strings are returning NULL for each row.

e.g. this returns all NULL

CREATE EXTERNAL TABLE kinesis_flagged (value STRING) 
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde'
WITH SERDEPROPERTIES ('paths'='value') 
LOCATION "s3n://S3_BUCKET_PATH";

Other embarrassing attempts include:

CREATE EXTERNAL TABLE kinesis_flagged (value STRING) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 'input.regex'='\\{\\"code\\": (.+?), \\"book_id\\": \\"(.+?)\\", \\"description\\": \\"(.+?)\\"\\}"' ) 
LOCATION "s3n://S3_BUCKET_PATH";

Any ideas on how I should be doing this? Furthermore, how do I create a table with the columns "code", "book_id", and "description"?

There are some hairy stuffs with Hive's string literal right now. Upgrade to Hive 2.0.0, or check whether your Hive distribution includes the fix to HIVE-11723. Anyway, parsing JSON in Hive is a bad idea. You might want to find JSON Serde library instead. — nhahtdh

john elemans john elemans · Accepted Answer · 2015-11-11T03:42:58

You are not likely to have success with regex on JSON. You should break this down into two tasks;

1 - parse the JSON with a library 2 - build your tables as needed

The sample looks like normal JSON except for some spurious double-quotes. Solve that first otherwise your JSON parsers will barf.

Parsing JSON arrays (escaped strings)

1 Answers