1
votes

I'm working on a project for a big data class, and I've installed the Cloudera Quickstart VM locally to run some basic tasks on my dataset and get familiar with some of the tools. I was following a tutorial which involved moving the dataset into HDFS, creating an HCatalog table based on the dataset file, then running Hive and/or Pig commands on the table. The problem is that my data is a large XML file, and the standard delimiter options in HCatalog do not apply.

Is there a way to import XML into HCatalog? If not, what is the best way to use Hive or Pig on my XML dataset?

EDIT: My file is from the public StackOverflow dataset. I am using the posts.xml file. It's quite large (25GB) and I'm having trouble opening it on my machine, but below is the structure according to the Readme file:

- **posts**.xml
   - Id
   - PostTypeId
      - 1: Question
      - 2: Answer
   - ParentID (only present if PostTypeId is 2)
   - AcceptedAnswerId (only present if PostTypeId is 1)
   - CreationDate
   - Score
   - ViewCount
   - Body
   - OwnerUserId
   - LastEditorUserId
   - LastEditorDisplayName="Jeff Atwood"
   - LastEditDate="2009-03-05T22:28:34.823"
   - LastActivityDate="2009-03-11T12:51:01.480"
   - CommunityOwnedDate="2009-03-11T12:51:01.480"
   - ClosedDate="2009-03-11T12:51:01.480"
   - Title=
   - Tags=
   - AnswerCount
   - CommentCount
   - FavoriteCount

Will the sheer size of this file be a problem in the VM? In the end we will be repeating some of these ETL tasks in AWS, but for now I am trying to avoid racking up a large bill without knowing how to properly use some of the tools.

1
Regarding the data size on the VM: That depends on the size of your VM and your storage type. I have that VM runnning with 8 cores and 24GB of RAM at home, with SSD for storage. Even with that, I would probably run only a subset of the data (maybe 1-3%). You're trying to learn, not performance test. If it takes 2 hours to run a test job before you see the output, it slows down the learning process.JamCon
@JamCon our project requires a large dataset such as this, but what I'm doing now is learning how to use the tools, then doing the "real" ETL on the "real" dataset in AWS. I guess I'll try to use a smaller dataset for testing/learning purposes.Amru E.

1 Answers

3
votes

XML uses a fairly standardized structure, so I would be interested in seeing your data format and what delimiter isn't working.

Without knowing more about the data/structure, etc ... this is probably what I would do:

  1. Decide on my schema and create the HCatalog manually (or scripted, whichever is easiest).
  2. Load the data via pig, using the piggybank XMLLoader.
  3. Parse the data using regex into the schema that I decided upon for the HCat
  4. Store it using the HCatStore method.



--Example code

REGISTER piggybank.jar

items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS  (item:chararray);

data = FOREACH items GENERATE 
REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS  link:chararray, 
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS  title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>',  1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS  pubdate:chararray;

STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();


validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();
dump validate;



-- Results

(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)



-- rss.txt data file

<rss version="2.0">
   <channel>
      <title>News</title>
      <link>http://www.hannonhill.com</link>
      <description>Hannon Hill News</description>
      <language>en-us</language>
      <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
      <generator>Cascade Server</generator>
      <webMaster>[email protected]</webMaster>
      <item>
         <title>News Item 1</title>
         <link>http://www.hannonhill.com/news/item1.html</link>
         <description>Description of news item 1 here.</description>
         <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item1.html</guid>
      </item>
      <item>
         <title>News Item 2</title>
         <link>http://www.hannonhill.com/news/item2.html</link>
         <description>Description of news item 2 here.</description>
         <pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item2.html</guid>
      </item>
      <item>
         <title>News Item 3</title>
         <link>http://www.hannonhill.com/news/item3.html</link>
         <description>Description of news item 3 here.</description>
         <pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item3.html</guid>
      </item>
   </channel>
</rss>