Data Versioning (Hadoop, HDFS, Hbase backends)

Question

I wonder how to version Data in Hadoop/HDFS/Hbase. It should be part of your model as changes are very likely (big-data is collected over a long time).

Main Example for HDFS (file based backend).

sample-log-file.log:

timestamp x1 y1 z1 ...
timestamp x2 y2 z2 ...

I now wonder where to add the versioning info. I see 2 alternatives:

Version inside file-format

log-file.log:


timestamp V1 x1 y1 z1 ...
timestamp V2 w1 x2 y2 z1 ...

Version inside file-name

*log-file_V1.log*


timestamp x1 y1 z1 ...

*log-file_V2.log*

timestamp w1 x1 y1 z1 ...

The 2nd option (version in file-name) feels a bit more clean to me and fits to HDFS (I can simply use *_v2* as pattern to exclude old version-style files). On the other hand I would then need to run 2 different jobs as I cannot analyze the version-snippet in one single job.

How about HBase, I guess in HBase the version would definetely end in another table-column (HDFS is implementation detail and used as backend for HBase)?

Any other alternative approach of versioning data for backends Hadoop/HDFS/HBase?

Thanks!

EDIT: my question is related how to handle version-information itself, not the timestamp.

David Gruzman David Gruzman · Accepted Answer · 2012-05-27T07:37:19

In my view, efficient data versioning required storing records of he same version in some proximity. Then you can have aplicative logic to select the right version for your need. It is similar to what some relational databases are doing.
This approach might be used by CoachDB, although i am not 100% sure about it.
Now lets look on HDFS/HBase. They are quite different from this perspective since HBase allows data mutation and editing, while HDFS is not.
So for the HBase you can have timestemp as a last part of the key, and all versions wil be together
HDFS is suited for storing small number of big files and we can not edit them. I would suggest to write all versions to the files in the order they arrive and use MapReduce to group together all versions of the record with different timestmps together in the reducer. It will not be efficient, since all data will pass shuffling but you will be in control. To solve it we can by doing this resolution periodically and store data with most records in one version.

Data Versioning (Hadoop, HDFS, Hbase backends)

Version inside file-format

Version inside file-name

2 Answers