0
votes

I have a requirement where in I have to parse the xml to get the desired fields, perform operations on the desired field and generate a csv using the data.

I looked at XmlLoader available in pig, however it seems to return the xml tags as well. What I am interested in is the data. Is there any way I can achieve this? I also need to generate a CSV using the data.

Any working samples would be of great help.

2
How nested / hierarchical is your XML, can you post an example? - Chris White

2 Answers

0
votes

You could use REGEX_EXTRACT() to get the information out of the tags, possibbly SUBSTRING() and REGEX_EXTRACT_ALL as well.

0
votes

piggybank jar provides xml loader in pig

In the load statement of pig you need to load using XMLLoader. Where you need to mention your parent tag properly.

A = load '/path of the file' using org.apache.pig.piggybank.storage.XMLLoader('parent_tag') as (x:chararray);
B = foreach A generate REPLACE(x,'[\\n]','') as x;

After this you need to use REGEX_EXTRACT_ALL to extract the data in between the tags

C = foreach B generate REGEX_EXTRACT_ALL(x,'.*(?:<child_tag1>)([^<]*).*(?:<child_tag2>)([^<]*).*');

For more details you can refer to the below link

https://acadgild.com/blog/converting-xml-into-csv-using-pig/