0
votes

A bit of intro - I'm learning about Hadoop. I have implemented machine learning algorithm on top of Hadoop (clustering) and tested it only on a small example (30MB).

A couple of days ago I installed Ambari and created a small cluster of four machines (master and 3 workers). Master has Resource manager and NameNode.

Now I'm testing my algorithm by increasing the amount of data (300MB, 3GB). I'm looking for a pointer how to tune up my mini-cluster. Concretely, I would like to know how to determine MapReduce2 and YARN settings in Ambari.

How to determine min/max memory for container, reserved memory for container, Sort Allocation Memory, map memory and reduce memory?

The problem is that execution of my jobs is very slow on Hadoop (and clustering is an iterative algorithm, which makes things worse).

I have a feeling that my cluster setup is not good, because of the following reason:

  • I run a job for a dataset of 30MB (I set-up block memory for this job to be 8MB, since data is small and processing is intensive) - execution time 30 minutes
  • I run the same job, but multiply same dataset 10 times - 300MB (same block size, 8MB) - execution time 2 hours
  • Now same amount of data - 300MB, but block size 128MB - same execution time, maybe even a bit greater than 2 hours

Size of blocks on HDFS is 128MB, so I thought that this will cause the speedup, but that is not the case. My doubts are that the cluster setup (min/max RAM size, map and reduce RAM) is not good, hence it cannot improve even though greater data locality is achieved.

Could this be the consequence of a bad setup, or am I wrong?

1

1 Answers

0
votes

Please set the below properties in Yarn configuratins to allocate 33% of max yarn memory per job, which can be altered based on your requirement.

yarn.scheduler.capacity.root.default.user-limit-factor=1 yarn.scheduler.capacity.root.default.user-limit-factor=0.33 If you need further info on this, please refer following link https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/