1
votes

I wonder how to version Data in Hadoop/HDFS/Hbase. It should be part of your model as changes are very likely (big-data is collected over a long time).

Main Example for HDFS (file based backend).

sample-log-file.log:

timestamp x1 y1 z1 ...
timestamp x2 y2 z2 ...

I now wonder where to add the versioning info. I see 2 alternatives:

Version inside file-format

log-file.log:


timestamp V1 x1 y1 z1 ...
timestamp V2 w1 x2 y2 z1 ...

Version inside file-name

*log-file_V1.log*


timestamp x1 y1 z1 ...

*log-file_V2.log*

timestamp w1 x1 y1 z1 ...

The 2nd option (version in file-name) feels a bit more clean to me and fits to HDFS (I can simply use *_v2* as pattern to exclude old version-style files). On the other hand I would then need to run 2 different jobs as I cannot analyze the version-snippet in one single job.

How about HBase, I guess in HBase the version would definetely end in another table-column (HDFS is implementation detail and used as backend for HBase)?

Any other alternative approach of versioning data for backends Hadoop/HDFS/HBase?

Thanks!

EDIT: my question is related how to handle version-information itself, not the timestamp.

2

2 Answers

0
votes

In my view, efficient data versioning required storing records of he same version in some proximity. Then you can have aplicative logic to select the right version for your need. It is similar to what some relational databases are doing.
This approach might be used by CoachDB, although i am not 100% sure about it.
Now lets look on HDFS/HBase. They are quite different from this perspective since HBase allows data mutation and editing, while HDFS is not.
So for the HBase you can have timestemp as a last part of the key, and all versions wil be together
HDFS is suited for storing small number of big files and we can not edit them. I would suggest to write all versions to the files in the order they arrive and use MapReduce to group together all versions of the record with different timestmps together in the reducer. It will not be efficient, since all data will pass shuffling but you will be in control. To solve it we can by doing this resolution periodically and store data with most records in one version.

2
votes

For HDFS, storing the timestamps inside the file uses a lot more space (the timstamp is repeated for every line) but gives you the flexibility to hold multiple dates in a single file. Which is preferable depends entirely on your use case.

For HBase, you have a couple options: you can explicitly include a timestamp (and/or version number) in the row key, and make different versions of a data item into different rows in the table; or, you can use HBase's built-in time dimension, which actually includes a timestamp on every cell in the database (i.e. every value in every column in every row) and allows you to keep a configurable number of versions around. By default, scans return only the most recent version of each key/value, but you can change that behavior at scan time to return multiple versions, or only versions in a given time range.