0
votes

I'm doing some tests with M/R jobs running on 2 nodes Hadoop 2.2.0 cluster. One thing I would like to understand is the performance considerations of running the job in local mode (not managed by the ResourceManager) and running it on YARN. Tests I made show it runs much much faster when the job is being executed via LocalJobRunner than when it being managed by YARN. When set up the cluster I was following the steps described here http://raseshmori.wordpress.com/2012/10/14/install-hadoop-nextgen-yarn-multi-node-cluster/ , perhaps there is some configuration the guide forgot to mention?

Thanks!

2

2 Answers

0
votes

You'd run LocalJobRunner for tests and small examples. You'd use the cluster when you need to processes amounts of data that would justify using Hadoop in the first place (a.k.a "Big data").

When you run a small example the overhead of running things distributed overwhelms the benefits of parallelization

0
votes

Arnon is right. I found out that in one of my usecases that running using LocalJobRunner is much faster than using yarn. Running using LocalJobRunner would run the map processes as in-process and in local machine. Jobs are not submitted to HDFS cluster. Hence, map tasks are not scheduled in multiple machines. So, use LocalJobRunner shall be used for unit testing the code. Thats it. For all other practical purposes, use yarn.