
I have a CSV file of the below format

customerid, period, credit, debit
 100, jan-2017, 500, 300
 100, jan-2017, 300,0
 100, feb-2017, 200,100
 100, mar-2017, 200,10
 200, jan-2017, 100, 200
 200, feb-2017,100,200

Now my requirement is to first group by customer id and then group by period and consolidate the transactions and create a hierarchical JSON as below using Apache Pig scripts.

        "customerid": 100,
        "periods": [{
            "period": "jan-2017",
            "transactions": [{"credit": 500,"debit": 300},....]
        }, {
            "period": "feb-2017",
            "transactions": [...]
        }, {
            "period": "mar-2017",
            "transactions": [....]
    }, {
        "customerid": 200,
        "periods": [{
            "period": "jan-2017",
            "transactions": [.....]
        }, {
            "period": "feb-2017",
            "transactions": [.....]

I am fairly new to Pig but managed to write the below script

Data = LOAD 'data.csv' USING PigStorage(',') AS (

CompanyBag = GROUP Data BY (company_id);

final_trsnactionjson = FOREACH CompanyBag {
    ByCompanyId = FOREACH Data {
        PeriodBag = GROUP Data BY (period);

        IdPeriodItemRoot = FOREACH PeriodBag{
            ItemRecords = FOREACH Source GENERATE debit as debit, credit as credit
            GENERATE group as period, TOTUPLE(ItemRecords) as transactions;
    GENERATE group as customerid, TOTUPLE(PeriodBag) AS periods;

But this is giving me the below error

mismatched input '{' expecting GENERATE

I searched a lot on how to generate nested Json using Pig, but could not find any good pointers. Where am I going wrong? Thanks in advance for the help


1 Answers

  1. Please use JsonLoader available in Pig.

https://pig.apache.org/docs/r0.11.1/func.html#jsonloadstore you can provide nested schema in "AS"

  1. Use com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') for simpler use for handling any nested JSON arrays.