I wonder how to version Data in Hadoop/HDFS/Hbase. It should be part of your model as changes are very likely (big-data is collected over a long time).
Main Example for HDFS (file based backend).
sample-log-file.log:
timestamp x1 y1 z1 ...
timestamp x2 y2 z2 ...
I now wonder where to add the versioning info. I see 2 alternatives:
Version inside file-format
log-file.log:
timestamp V1 x1 y1 z1 ...
timestamp V2 w1 x2 y2 z1 ...
Version inside file-name
*log-file_V1.log*
timestamp x1 y1 z1 ...
*log-file_V2.log*
timestamp w1 x1 y1 z1 ...
The 2nd option (version in file-name) feels a bit more clean to me and fits to HDFS (I can simply use *_v2* as pattern to exclude old version-style files). On the other hand I would then need to run 2 different jobs as I cannot analyze the version-snippet in one single job.
How about HBase, I guess in HBase the version would definetely end in another table-column (HDFS is implementation detail and used as backend for HBase)?
Any other alternative approach of versioning data for backends Hadoop/HDFS/HBase?
Thanks!
EDIT: my question is related how to handle version-information itself, not the timestamp.