Hive table from JSON data using pyspark

Question

I wanted to create one flat hive table from json data which his coming from another hive table(placed in one column-event_data). Below is the json data structure. I have created hive table by using later view but now i wanted to use pyspark with some UDF's to create the hive table.

'{"callId":"0000000","journey":{"channel":{"out":"[email protected]","outbound":"EMAIL"}},"application":{"componentId":"23456","name":"dfgt-ghy-svc","applicationReferenceId":"SRFC98756RD"},"servicingDetail":{"offerAttributes":{"id":"ADLC0110000"},"offerCommunicationAttributes":{"id":"CFRGTV10098","status":"SUCCESS"}},"customerInfo":{"calledInAccount":"ERFCVDG9801"},"correlationId":"9845-sd76-sdfr87","fulfiller":{"id":"DEFC1009","category":"TST","entity":"colleague"},"platform":{"name":"v-generation","id":"37664859"}}'

I need only 4 columns need to extract which are callId, correlationId, servicingDetail -offerAttributes-id,fulfiller-id

Please help me to do the same.

avikm avikm · Accepted Answer · 2020-05-14T04:17:58

At first, create OutputTable in hive.

Then, select data using get_json_object() function from single column table and insert that data into final table. Refer for get_json_object().

INSERT INTO table OutputTable 
SELECT 
    get_json_object(event_data,'$.callId') as callId, 
    get_json_object(event_data,'$.correlationId') as correlationId,
    get_json_object(event_data,'$.servicingDetail.offerAttributes.id') as servicingDetail_offerAttributes_id,
    get_json_object(event_data,'$.fulfiller.id') as fulfiller_id,
FROM SingleColumnTable;

1st edit - Solution for pyspark

Read single column table into dataframe (suppose df name is dfSingleColumnedData), then apply below logic to get each column wise data. finally select your required columns.

from pyspark.sql import Row
from collections import OrderedDict

def convert_to_row(d: dict) -> Row:
    return Row(**OrderedDict(sorted(d.items())))

dfSingleColumnedData.rdd.map(convert_to_row).toDF()

Hive table from JSON data using pyspark

1 Answers

1st edit - Solution for pyspark