At the given time I have user file system in my application (apache CMIS). As it's growing bigger, I'm doubting to move to hadoop (HDFS) as we need to run some statistics on it as well. The problem: The current file system provides versioning of the files. When I read about hadoop - HDFS- and file versioning, I found most of the time that I have to write this (versioning) layer myself. Is there already something available to manage versioning of files in HDFS or do I really have to write it myself (don't want to reinvent the hot water, but don't find a proper solution either).
Answer
For full details: see comments on answer(s) below
Hadoop (HDFS) doesn't support versioning of files. You can get this functionality when you combine hadoop with (amazon) S3: Hadoop will use S3 as the filesystem (without chuncks, but recovery will be provided by S3). This solution comes with the versioning of files that S3 provides. Hadoop will still use YARN for the distributed processing.