0
votes

I have a hadoop cluster with 20 nodes where 15 of them have 1 file (on their local file system) with same name. What is the best way to read all these 15 files in a Map Reduce program?

One way to do this is manually run 'hadoop fs -put..' command on each of these 15 nodes to copy the file to HDFS but each with a different name on HDFS and then read files from HDFS in map reduce program but wondering if there is better alternative which avoid this manual transfer.

Thanks!!

1

1 Answers

0
votes

Taking a step back: how will a given Mapper know which of the local filesystem pathnames to use (given 5 of the 20 are different than the others)? Are they going to do trial and error?

Typically you try to avoid having variations between different mappers in terms of local environment / local filesystem settings. If you need to find specific files then yes it may make sense to include a pre-processing step that uploads the files from the individual mapper machines to an hdfs directory - maybe including the local hostname in the new path. Maybe you could mention a bit on the impetus for this somewhat non standard setup .

UPDATE based on OP clarification.

Add code within the mapper that

(a) checks if the file exists (on LocalFileSystem using normal java.io.File )
(b) if present then use java.io.FileInputStream, read it in. 
 (c) Then use **HDFS** commands to create new hdfs file and write the data to it

in = fs.open(new Path(uri));

So you will be reading from local FS and writing to HDFS. When you write to HDFS, maybe add the local machine hostname to the filename so as to differentiate among the 15 machines.

Another update The OP continues to add new requirements. To handle the case of multi mappers on the same machine then create hadoop Counter of the un-dotted IP address on the machine: each mapper checks if it were set and if not (a) sets it and (b) does the work.

As far as the NEW large file requirement / parallellism, well that's a new requirement that can not be met here. Please consider to accept this answer as having answered the original question. A separate discussion can be had about the new one you are posing.

Third Update How to handle uploading large local files to HDFS: I do not know any easy way to do this. The reason HDFS can load /process / store large files in parallel is they are broken into blocks. Local files are not splittable by the local file system.

That being said, you could do manual splitting of files and upload chunks of them in parallel via separate threads. Each thread would need to 'register' which offset into the file they are loading. There are however glaring issues here. (A) I wonder whether this would likely actually decraase speed since the Disk Seeks will not be sequential anymore. (B) How do you plan to save the chunks to hdfs - and then reconstitute them as a single file?