3
votes

I am trying to extract data from below mention json format by pig using jsonLoader():

{"Partition":"10","Key":"618897","Properties2":[{"K":"A","T":"String","V":"M "}, {"K":"B","T":"String","V":"N"}, {"K":"D","T":"String","V":"O"}]}
{"Partition":"11","Key":"618900","Properties2":[{"K":"A","T":"String","V":"W”"},{"K":"B","T":"String","V":"X"}, {"K":"C","T":"String","V":"Y"},{"K":"D","T":"String","V":"Z"}]}

Right now I am able to extract data from “partition” ,“key” and “V” for every array objects with the following code:

A= LOAD '/home/hduser/abc.jon' Using JsonLoader('Partition:chararray,Key:chararray,Properties2:{(K:chararray,T:chararray,V:chararray)},Timestamp:chararray');
B= foreach A generate $0,$1,BagToString(Properties2.V,'\t') as vl:chararray; 
store B into './Result/outPut2';

From above code I am getting "Properties2" array value on the sequence basis not column basis, it is creating problem whenever sequence changed or new object comes in existence. Please help me to extract data on the basis of column( K values.)

My Output enter image description here

Expected Output enter image description here

Thanks In Advance

1

1 Answers

1
votes

You have two options here

1.Use elephant-bird which will give you a map of key and value.

A = LOAD '/apps/pig/json_sample' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]);
B = FOREACH A GENERATE json#'Partition',json#'Key',json#'Properties2';
dump B;

will give you an output of :

(10,618897,{([T#String,K#A,V#M ]),([T#String,K#B,V#N]),([T#String,K#D,V#O])})
(11,618900,{([T#String,K#A,V#W”]),([T#String,K#B,V#X]),([T#String,K#C,V#Y]),([T#String,K#D,V#Z])})

Or you have to write a custom loader which has to do this

a).It should know what is the correct order of values that will be coming for the key K

b).Go through each of these values and see if the json is missing any of this key and return an empty/null char for that location.

Am posting the getNext() method of the CustomJsonLoader which will do the same:

@Override
public Tuple getNext() throws IOException {
    // TODO Auto-generated method stub
    try {
        boolean notDone = in.nextKeyValue();
        if (!notDone) {
            return null;
        }
        Text value = (Text) in.getCurrentValue();
        List<String> valueList = new ArrayList<String>();
        if (value != null) {

            String jsonString = value.toString();
            System.out.println(jsonString);
            JSONParser parser = new JSONParser();
            JSONObject obj = null;
            try {
                obj = (JSONObject) parser.parse(jsonString);
            } catch (ParseException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
            System.out.println("obj is "+obj);
            if (obj != null) {
                String partition = (String) obj.get("Partition");
                String key = (String) obj.get("Key");
                valueList.add(partition);
                valueList.add(key);
                JSONArray innArr = (JSONArray) obj.get("Properties2");
                char[] innKeys = new char[] { 'A', 'B', 'C', 'D' };
                Map<String,String> keyMap = new HashMap<String,String>();
                for (Object innObj : innArr) {
                    JSONObject jsonObj = (JSONObject) innObj;
                    keyMap.put(jsonObj.get("K")+"",jsonObj.get("V")+"");
                }
                for (int i = 0; i < innKeys.length; i++) {
                    char ch = innKeys[i];
                    if (keyMap.containsKey(ch+"")) {
                        valueList.add(keyMap.get(ch+""));
                    }else{
                        valueList.add("");
                    }

                }
                Tuple t = tupleFactory.newTuple(valueList);
                return t;
            }
        }

        return null;
    } catch (InterruptedException e) {
    }
}

and register it and run :

REGISTER udf/CustomJsonLoader.jar
A = LOAD '/apps/pig/json_sample' USING CustomJsonLoader();
DUMP A;
(10,618897,M,N,,O)
(11,618900,W,X,Y,Z)

Hope this helps!