0
votes

i have a multi level xml and i could not find any example how i can load it.

XML file:

   <?xml version="1.0" encoding="UTF-8" ?>
        <Feed xmlns="http://www.xx.com/PRR/ProductFeed/1.0"
              name="xx"
              incremental="false"
              extractDate="2014-04-22T11:00:00.000000"><Categories><Category>  <ExternalId>2_5</ExternalId><ParentExternalId></ParentExternalId><Name><![CDATA[Baby]]></Name><CategoryPageUrl>http://www.xx.com/en-US/Clearance/Baby-0-3yrs-Clothing.html</CategoryPageUrl></Category><Category><ExternalId>2_3</ExternalId><ParentExternalId></ParentExternalId><Name><![CDATA[Boys 1½-12yrs]]></Name><CategoryPageUrl>http://www.xx.com/en-US/Clearance/Boys-1H-12yrs-Clothing.html</CategoryPageUrl></Category></Categories>
              <Products><Product><ExternalId>78094</ExternalId><Name><![CDATA[Sleep Bag]]></Name><Description><![CDATA[A cover they can't throw off in the night. Pure cotton with one of our uniquely lovely prints. In its own gift box. An ultra thoughtful, luxurious present.]]></Description><Brand>xx</Brand><CategoryExternalId>1_5_1</CategoryExternalId><ProductPageUrl>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094/Baby-0-3yrs-Sleep-Bag.html</ProductPageUrl><ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</ImageUrl><SwatchImageUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchImageUrl><Price>54.0000</Price><Wasprice>54.0000</Wasprice><ManufacturerPartNumber></ManufacturerPartNumber><EAN></EAN><Colours><Variation><Tier2>MUL</Tier2><Tier2Descr><![CDATA[Multi Elephant Party]]></Tier2Descr><Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url><Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl><Tier3>03 06</Tier3><Tier3Descr><![CDATA[3-6m]]></Tier3Descr><StockStatus>-2</StockStatus><SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl></Variation><Variation><Tier2>MUL</Tier2><Tier2Descr><![CDATA[Multi Elephant Party]]></Tier2Descr><Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url><Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl><Tier3>06 18</Tier3><Tier3Descr><![CDATA[6-18m]]></Tier3Descr>  <StockStatus>-2</StockStatus>   <SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl>        </Variation></Colours></Product>
              </Products>
        </Feed>

i have tried like this, but it gives back empty rows, and i need also Products as well not only categories

REGISTER 'lib/pig/piggybank.jar'

-- load raw

raw = load '$Input' using org.apache.pig.piggybank.storage.XMLLoader('Category') 
    as (x:chararray);

raw_flatten = foreach raw GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,
    '<Category>\\n\\s*<ExternalId>(.*)</ExternalId>\\n\\s*<ParentExternalId>(.*)</ParentExternalId>\\n\\s*<Name>(.*)</Name>\\n\\s*<CategoryPageUrl>(.*)</CategoryPageUrl>\\n\\s*</Category>'))
    as (external_id:chararray, parent_external_id:chararray, name:chararray, categorypageurl:chararray);

how can i load the above xml?

thanks in advance

update: if i put a linebreak after each field then I can read the data... How can i solve this problem? other tools does not need linebreaks and i cannot alter source data.

formatted xml:

<?xml version="1.0" encoding="UTF-8" ?>
<Feed xmlns="http://www.xx.com/PRR/ProductFeed/1.0"
              name="xx"
              incremental="false"
              extractDate="2014-04-22T11:00:00.000000">
 <Categories>
  <Category>
   <ExternalId>2_5</ExternalId>
   <ParentExternalId></ParentExternalId>
   <Name>Baby</Name>
   <CategoryPageUrl>http://www.xx.com/en-US/Clearance/Baby-0-3yrs-Clothing.html</CategoryPageUrl>
  </Category>
  <Category>
   <ExternalId>2_3</ExternalId>
   <ParentExternalId></ParentExternalId>
   <Name>Boys 1½-12yrs</Name>
   <CategoryPageUrl>http://www.xx.com/en-US/Clearance/Boys-1H-12yrs-Clothing.html</CategoryPageUrl>
  </Category>
 </Categories>
 <Products>
  <Product>
   <ExternalId>78094</ExternalId>
   <Name>Sleep Bag</Name>
   <Description>A cover they can't throw off in the night. Pure cotton with one of our uniquely lovely prints. In its own gift box. An ultra thoughtful, luxurious present.</Description>
   <Brand>xx</Brand>
   <CategoryExternalId>1_5_1</CategoryExternalId>
   <ProductPageUrl>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094/Baby-0-3yrs-Sleep-Bag.html</ProductPageUrl>
   <ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</ImageUrl>
   <SwatchImageUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchImageUrl>
   <Price>54.0000</Price>
   <Wasprice>54.0000</Wasprice>
   <ManufacturerPartNumber></ManufacturerPartNumber>
   <EAN></EAN>
   <Colours>
    <Variation>
     <Tier2>MUL</Tier2>
     <Tier2Descr>Multi Elephant Party</Tier2Descr>
     <Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url>
     <Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl>
     <Tier3>03 06</Tier3>
     <Tier3Descr>3-6m</Tier3Descr>
     <StockStatus>-2</StockStatus>
     <SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl>
    </Variation>
    <Variation>
     <Tier2>MUL</Tier2>
     <Tier2Descr>Multi Elephant Party</Tier2Descr>
     <Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url>
     <Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl>
     <Tier3>06 18</Tier3>
     <Tier3Descr>6-18m</Tier3Descr>
     <StockStatus>-2</StockStatus>
     <SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl>
    </Variation>
   </Colours>
  </Product>
 </Products>
</Feed>
1
I was able to format the xml and now can read categories, but cannot read products, because there is an embedded variantions in it. How i can load this xml? - clairvoyant

1 Answers

0
votes

Your regex string seems to be expecting a new line character:

\\n\\s*

Change this to [\n\s]* and it should work