3
votes

I have an array of JSON objects like such. Each array encapsulated by [ and ] are on a single line.

[{"event":0,"properties":{"color":"red","connectionType":2}}{"event":30,"properties":{"color":"blue","connectionType":4}},{"event":45,"properties":{"color":"green","connectionType":3}}] [{"event":0,"properties":{"color":"red","connectionType":5}}, {"event":1,"properties":{"color",:"blue","connectionType":6}}]

Here it is in an easier to read format.

[
    {"event":0, "properties":{"color":"red","connectionType":2}},
    {"event":3, "properties":{"color":"blue",'connectionType":4}},
    {"event":45, "properties":{"color":"green","connectionType":3}}
]
[
    {"event":0, "properties":{"color":"red","connectionType":5}},
    {"event":1, "properties":{"color":"blue","connectionType":6}}
]

Some things to note, so each JSON object inside an [ ] are in a single line. The number of objects in each line varies. The number of fields inside properties also varies.

What I want with this data, is to take each JSON object and convert it to tabular format in the form of comma separated or tab separated values

| event    | color    | connectionType
  0          red        2
  3          blue       4

I've looked at a few tools that are used by PIG to read JSON structures - namely elephant-bird, but can't quite get it to work on my data.

I'm hoping to get pointers on alternative solutions, or example code using elephant-bird / other pig json parsers. My end goal is really to just capture a subset of events and properties and load them into Hive.

1

1 Answers

2
votes

in your json file. you don't have start object. So it is not differentiate between rows. I found solution but i have put start object in your json object.

{"startObject":[{"event":0, "properties":{"color":"red","connectionType":2}},{"event":3, "properties":{"color":"blue","connectionType":4}},{"event":45, "properties":{"color":"green","connectionType":3}}]}

A = LOAD '/home/kishore/Data/Pig/pig.json' USING JsonLoader('{(event:chararray,properties: (color:chararray,connectionType:chararray))}');
B = foreach A generate Flatten($0);
C = foreach B generate $0,Flatten($1);
Dump C;

Result

(0,red,2)
(3,blue,4)
(45,green,3)

if you want to parse your json object without putting start object, in this case you should write your own custom UDF. https://gist.github.com/kimsterv/601331

or go for elephant-bird https://github.com/twitter/elephant-bird