I'm working on a project for a big data class, and I've installed the Cloudera Quickstart VM locally to run some basic tasks on my dataset and get familiar with some of the tools. I was following a tutorial which involved moving the dataset into HDFS, creating an HCatalog table based on the dataset file, then running Hive and/or Pig commands on the table. The problem is that my data is a large XML file, and the standard delimiter options in HCatalog do not apply.
Is there a way to import XML into HCatalog? If not, what is the best way to use Hive or Pig on my XML dataset?
EDIT: My file is from the public StackOverflow dataset. I am using the posts.xml
file. It's quite large (25GB) and I'm having trouble opening it on my machine, but below is the structure according to the Readme file:
- **posts**.xml
- Id
- PostTypeId
- 1: Question
- 2: Answer
- ParentID (only present if PostTypeId is 2)
- AcceptedAnswerId (only present if PostTypeId is 1)
- CreationDate
- Score
- ViewCount
- Body
- OwnerUserId
- LastEditorUserId
- LastEditorDisplayName="Jeff Atwood"
- LastEditDate="2009-03-05T22:28:34.823"
- LastActivityDate="2009-03-11T12:51:01.480"
- CommunityOwnedDate="2009-03-11T12:51:01.480"
- ClosedDate="2009-03-11T12:51:01.480"
- Title=
- Tags=
- AnswerCount
- CommentCount
- FavoriteCount
Will the sheer size of this file be a problem in the VM? In the end we will be repeating some of these ETL tasks in AWS, but for now I am trying to avoid racking up a large bill without knowing how to properly use some of the tools.