Hadoop (HDFS) - file versioning

Question

At the given time I have user file system in my application (apache CMIS). As it's growing bigger, I'm doubting to move to hadoop (HDFS) as we need to run some statistics on it as well. The problem: The current file system provides versioning of the files. When I read about hadoop - HDFS- and file versioning, I found most of the time that I have to write this (versioning) layer myself. Is there already something available to manage versioning of files in HDFS or do I really have to write it myself (don't want to reinvent the hot water, but don't find a proper solution either).

Answer

For full details: see comments on answer(s) below

Hadoop (HDFS) doesn't support versioning of files. You can get this functionality when you combine hadoop with (amazon) S3: Hadoop will use S3 as the filesystem (without chuncks, but recovery will be provided by S3). This solution comes with the versioning of files that S3 provides. Hadoop will still use YARN for the distributed processing.

franklinsijo franklinsijo · Accepted Answer · 2017-03-13T13:17:52

Versioning is not possible with HDFS.
Instead you can use Amazon S3, which provides Versioning and is also compatible with Hadoop.

Hadoop (HDFS) - file versioning

2 Answers