1
votes

I am not able to load nested JSON data into Hive table. Could someone please help me? Below is what I have tried:

Sample Input:

{"DocId":"ABC","User1":{"Id":1234,"Username":"sam1234","Name":"Sam","ShippingAddress":{"Address1":"123 Main St.","Address2":null,"City":"Durham","State":"NC"},"Orders":[{"ItemId":6789,"OrderDate":"11/11/2012"},{"ItemId":4352,"OrderDate":"12/12/2012"}]}}

On Hive (CDH3):

ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;

CREATE TABLE json_tab(
    DocId string,
    user1 struct<Id: int, Username: string, Name:string,ShippingAddress:struct<address1:string,address2:string,city:string,state:string>,orders:array<struct<ItemId:int,orderdate:string>>>
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
STORED AS TEXTFILE;  

hive> select * from json_tab;
OK
NULL    null

I am getting NULLs here.

Also tried with HCatalog jar:

ADD JAR /home/training/Desktop/hcatalog-core-0.11.0.jar;

 CREATE TABLE json_tab(
    DocId string,
    user1 struct<Id: int, Username: string, Name:string,ShippingAddress:struct<address1:string,address2:string,city:string,state:string>,orders:array<struct<ItemId:int,orderdate:string>>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

But facing below error with my create table statement:

FAILED: Error in metadata: Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

Could someone please help me? Thanks for your help in advance.

3

3 Answers

4
votes

you can use org.openx.data.jsonserde.JsonSerDe class to rad the json data

you can download jar file from http://www.congiu.net/hive-json-serde/1.3.6-SNAPSHOT/cdh4/

and do following steps

add jar /path/to/jar/json-serde-1.3.6-jar-with-dependencies.jar;

CREATE TABLE json_tab(
    DocId string,
    user1 struct<Id: int, Username: string, Name:string,ShippingAddress:struct<address1:string,address2:string,city:string,state:string>,orders:array<struct<ItemId:int,orderdate:string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

LOAD DATA LOCAL INPATH  '/path/to/data/nested.json' INTO TABLE json_tab;

SELECT DocId, User1.Id, User1.ShippingAddress.City as city,
User1.Orders[0].ItemId as order0id,
User1.Orders[1].ItemId as order1id from json_tab;


result
ABC     1234    Durham  6789    4352
0
votes
 I was getting same exception.

I added following jars and it worked for me.

ADD JAR /home/cloudera/Data/json-serde-1.3.7.3.jar;
ADD JAR /home/cloudera/Data/hive-hcatalog-core-0.13.0.jar;
0
votes

Using HiveQL to analyse JSON files require either org.openx.data.jsonserde.JsonSerDe or org.apache.hive.hcatalog.data.JsonSerDe to work correctly.

org.apache.hive.hcatalog.data.JsonSerDe
This is the default JSON SerDe from Apache. This is commonly used to process JSON data like events. These events are represented as blocks of JSON-encoded text separated by a new line. The Hive JSON SerDe does not allow duplicate keys in map or struct key names.

org.openx.data.jsonserde.JsonSerDe
OpenX JSON SerDe is similar to native Apache; however, it offers multiple optional properties such as "ignore.malformed.json", "case.insensitive", and many more. In my opinion, it usually works better when dealing with nested JSON files.

See the working example below:

CREATE EXTERNAL TABLE IF NOT EXISTS `dbname`.`tablename` (
    `DocId` STRING,
    `User1` STRUCT<
    `Id`:INT,
`Username`:STRING,
`Name`:STRING,
`ShippingAddress`:STRUCT<
    `Address1`:STRING,
`Address2`:,
`City`:STRING,
`State`:STRING>,
`Orders`:STRUCT<
    `ItemId`:INT,
`OrderDate`:STRING>>)
ROW FORMAT SERDE 
     'org.openx.data.jsonserde.JsonSerDe'
LOCATION
     's3://awsexamplebucket1-logs/AWSLogs/'

Create table statement generated from: https://www.hivetablegenerator.com/