3
votes

I want to run some executables outside of hadoop (but on the same cluster) using input files that are stored inside HDFS.

Do these files need to be copied locally to the node? or is there a way to access HDFS outside of hadoop?

Any other suggestions on how to do this are fine. Unfortunately my executables can not be run within hadoop though.

Thanks!

4

4 Answers

5
votes

There are a couple typical ways:

  • You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
  • You can stream your data with hadoop cat if your program takes input through stdin: hadoop fs -cat /path/to/file/part-r-* | myprogram.pl. You could hypothetically create a bridge with this command line command with something like popen.
3
votes

Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.

0
votes

The best way is install "hadoop-0.20-native" package on the box where you are running your code. hadoop-0.20-native package can access hdfs filesystem. It can act as a hdfs proxy.

0
votes

I had similar issue and asked appropriate question. I needed to access HDFS / MapReduce services outside of cluster. After I found solution I posted answer here for HDFS. Most painfull issue there happened to be user authentication which in my case was solved in most simple case (complete code is in my question).

If you need to minimize dependencies and don't want to install hadoop on clients here is nice Cloudera article how to configure Maven to build JAR for this. 100% success for my case.

Main difference in Remote MapReduce job posting comparing to HDFS access is only one configuration setting (check for mapred.job.tracker variable).