Reusing HDFS storage by several Hadoop installations

Question

Is that possible to reuse HDFS storage for two or more Hadoop installations? Or saying in other words, replicate NameNode state.

I want to build a small showcase Hadoop cluster (3-5 nodes) and I'd like to be able to play around with several Hadoop distributions (Hortonworks and Cloudera at least). I have not decided yet, how to have them installed simultaneously and it also seems to be a challenge, but currently I'd like to decide - is that possible to reuse the data stored in HDFS for different Clusters (physically using the same hard disks)?

For simplicity, I'll be happy if it works for any combination of Hadoop distros and I'm ready to lose my data at some point, because it's just an experiment.

UPDATE: I want to use HDFS exclusively with one chosen Hadoop installation at a time. Let's say one day I use Cloudera, the other Hortonworks, but they both use the same data in HDFS.

<opinion>this sounds like more trouble than it is worth</opinion> — Donald Miner
I agree, that's why I asked it on SO. If there is a reasonable solution hope somebody will give me a hint, otherwise it will stay without an answer. — Viacheslav Rodionov

apesa apesa · Accepted Answer · 2014-04-14T22:46:41

The one caveat would be that you would need to have these on separate machines since you would not be able to bind multiple NameNodes to the same port 8020.

Having said that Cloudera and Horton Works all use the same Hadoop binaries and the same configuration options as you would if you built it all yourself. The difference would be in each of their Management Consoles that do not come with the base opensource Hadoop releases. My suggestion is to look into configuring a single Hadoop group and Userbase that all have access to the same HDFS NameNodes / DataNodes and Jobtrackers, etc.. You should then be able to bind all your NameNodes to the same HDFS file system. You will have to setup each users ssh permissions as well.

There are some limitations though such as HDFS supporting exclusive writes only. When the first client contacts the name-node to open a file for writing, the name-node grants a lease to the client to create this file. When the second client tries to open the same file for writing, the name-node will see that the lease for the file is already granted to another client, and will reject the open request for the second client.

I would also configure HDFS dirs accordingly in order to preserve some level of organization.

I did just this with Hadoop 0.23 and 2.2.0 in VMWare / Ubuntu.

Lastly take a look here for the official Hadoop wiki and FAQ.

Good luck, Pat

Reusing HDFS storage by several Hadoop installations

1 Answers