Hadoop and structured data

Question

I have xml database with data like this:

<events>
      <event id="123">
            <location>ABC</location>
            <subsystem>Storage</subsystem>
            <warning>
                <date>2014-04-01</date>
                <text>warning1<text/>
            </warning>
            <warning>
                <date>2014-04-02</date>
                <text>warning2<text/>
            </warning>
            <warning>
                <date>2014-04-03</date>
                <text>warning3<text/>
            </warning>
       </event>
       ....
</events>

The amount of data is growing, so I would like to switch to process it with Hadoop. Let say that for each event I would like to add one extra node: <level>......</level> base on <warning> nodes. So now there is multiple problems to solve:

How structured data can be stored in Hadoop ? I can keep it in xml, but I don't see any tool with native xml/json support (pig support json, but without lists). I can split it by columns to different files (one for events and one for warning and then join them by event id), but there are a lot of subnodes (this is only part of orginal format), so join all of them each time will be problematic.
New column (level) can be stored in new generated xml files with current data or can be stored in new file just as mapping event_id to level. Storing all data in new xml files will require generate xml file again, but store it different file will require joining them every time I need to access level. Is there someting between (just update row in some format?).
It would be great to be able to easily add new nodes/columns just for few rows (like when there is critical level I would like to add extra note). It can be easly done with xml, but if there are columns it require to add new column for all rows.
Most of tools support only flat structure. There are tools like Hive with HQL, but there will be too many joins in my case, so I prefere to keep data in single structured record. Is there already some solutions solving this problem ?

Jerome Banks Jerome Banks · Accepted Answer · 2014-06-11T17:47:02

Look at using Avro (http://avro.apache.org ) or Google Protobuf's https://code.google.com/p/protobuf/ as a format for storing your data, instead of XML, for your format, and use the Avro Serde to interpret records in a Hive table.

Avro supports versioning, so you could have different records with different sets of columns, depending upon the underlying version, and the the version of the schema used in the table definition. Avro should also support your requirement for arbitrarily nested and complex structures.

Hadoop and structured data

2 Answers