2
votes

I have a scenario where I process 1000's of small files using Hadoop. The output of the Hadoop job is then to be used as input for a non-Hadoop algorithm. In the current workflow, data is read, converted to Sequence files, processed and resulting small files are then outputted to HDFS in the form of Sequence file. However, non-Hadoop algorithm cannot understand Sequence File. Therefore, I've written another simple Hadoop job to read resulting files' data from Sequence File and create final small files that can be used by the non-Hadoop algorithm.

The catch here is that for the final job I have to read Sequence Files from HDFS and write to Local file system of each node to be processed by non-Hadoop algorithm. I've tried setting the output path as file:///<local-fs-path> and using Hadoop LocalFileSystem class. However, doing so outputs the final results to namenode's local file system only.

Just to complete the picture, I have 10 nodes Hadoop setup with Yarn. Is there a way in Hadoop Yarn mode to read data from HDFS and write results to local file system of each processing node?

Thanks

1
You could mount an NFS drive... I don't see the benefits of writing to local datanodes if you just need to collect all the results anyway. Also , hadoop doesn't perform well with thousands of tiny files, so are you sure you're using the correct processes? - OneCricketeer
Unfortunately, the project requirements are as stated. Processing with Hadoop actually saved us more than 20 hours of work despite a lot of files so I'd say we are good with Hadoop. Thanks for suggesting NFS though, we already considered that. - F Baig

1 Answers

0
votes

Not really. While you can write to LocalFileSystem, you can't ask YARN to run your application on all nodes. Also, depending on how your cluster is configured, YARN's Node Managers might not be running on all nodes of your system.

A possible workaround is to keep your converted files in HDFS then have your non-Hadoop process first call hdfs dfs -copyToLocal.