I'm a total Hive noob, trying to parse escaped strings of JSON arrays with a Hive query on EMR. Each line returned from my S3 bucket looks like this:
["{\"code\": 14468278120, \"book_id\": \"rUkt3UHYyumTy7KCr7T72y\", \"description\": \”fantasyl\"}", "{\"code\": 14468304270, \"book_id\": \"wyyoenAWvmg3HhpZqVQr8i\", \"description\": \”sci fi\”}”, "{\"code\": 14468310080, \"book_id\": \"4UyygAduLhs3qyHTyJLZiD\", \"description\": \”history\”}”, "{\"code\": 14468313170, \"book_id\": \"6KPnPiTBViZJeG3eguUZig\", \"description\": \”architecture\”}”, "{\"code\": 14468313650, \"book_id\": \"U5zAirv3HYvGYANkmb64Yy\", \"description\": \”design\”}”]
So far, any attempts with Regex Serde and JSON SerDe to get anything out of these strings are returning NULL for each row.
e.g. this returns all NULL
CREATE EXTERNAL TABLE kinesis_flagged (value STRING)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde'
WITH SERDEPROPERTIES ('paths'='value')
LOCATION "s3n://S3_BUCKET_PATH";
Other embarrassing attempts include:
CREATE EXTERNAL TABLE kinesis_flagged (value STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ( 'input.regex'='\\{\\"code\\": (.+?), \\"book_id\\": \\"(.+?)\\", \\"description\\": \\"(.+?)\\"\\}"' )
LOCATION "s3n://S3_BUCKET_PATH";
Any ideas on how I should be doing this? Furthermore, how do I create a table with the columns "code", "book_id", and "description"?