6
votes

I have 10 nodes in hadoop cluster with 32GB RAM and one with 64GB.

For these 10 nodes node limit yarn.nodemanager.resource.memory-mb is set to 26GB and for 64GB node to 52GB (have some jobs that require 50GB for single reducer, they run on this node)

The problem is, when I run basic jobs that require say 8GB for mapper, 32GB nodes spawn 3 mappers in parallel (26 / 8 = 3) and 64GB node spawns 6 mappers. This node usually finishes last, because of CPU load.

I'd like to limit job container resources programmatically, e.g. set container limit to 26GB for most of the jobs. How can it be done?

2

2 Answers

2
votes

First of all yarn.nodemanager.resource.memory-mb (Memory) , yarn.nodemanager.resource.cpu-vcores (vcore) are Nodemanager daemon/service configuration properties and cannot be overriden in the YARN client applications. You need to restart nodemanager services if you change these configuration properties.

Since CPU is the bottleneck in your case, My recommendation is to change the YARN scheduling strategy to Fairscheduler with DRF (Dominant Resource Fairness) scheduling policy in the cluster level so that you will get the flexibility to specify application container size in terms of both memory and cpu core. Number of running application containers(mapper/reducer/AM/tasks) will be based on the available vcores that you define

Scheduling policy can be set at the Fair scheduler queue/pool level.

schedulingPolicy: to set the scheduling policy of any queue. The allowed values are “fifo”/“fair”/“drf”

See this apache doc for more details -

Once you have created new Fair scheduler queue/pool with DRF scheduling policy, both memory can cpu core can be set in the program as follows.

Configuration conf = new Configuration();

How to define container size in a mapreduce application.

Configuration conf = new Configuration();

conf.set("mapreduce.map.memory.mb","4096");
conf.set(mapreduce.reduce.memory.mb","4096");

conf.set(mapreduce.map.cpu.vcores","1");
conf.set(mapreduce.reduce.cpu.vcores","1");

Reference - https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Default value of cpu.vcores allocation for mapper/reducer will be 1, You can increase this value if it's a cpu intensive application. Remember If you increase this value, number of mapper/reducer tasks running in parallel also will be reduced.

0
votes

You have to set the configuration like this. Try this

// create a configuration
Configuration conf = new Configuration();
// create a new job based on the configuration
Job job = new Job(conf);
// here you have to put your mapper class
job.setMapperClass(Mapper.class);
// here you have to put your reducer class
job.setReducerClass(Reducer.class);
// here you have to set the jar which is containing your 
// map/reduce class, so you can use the mapper class
job.setJarByClass(Mapper.class);
// key/value of your reducer output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// this is setting the format of your input, can be TextInputFormat
job.setInputFormatClass(SequenceFileInputFormat.class);
// same with output
job.setOutputFormatClass(TextOutputFormat.class);
// here you can set the path of your input
SequenceFileInputFormat.addInputPath(job, new Path("files/toMap/"));
// this deletes possible output paths to prevent job failures
FileSystem fs = FileSystem.get(conf);
Path out = new Path("files/out/processed/");
fs.delete(out, true);
// finally set the empty out path
TextOutputFormat.setOutputPath(job, out);

// this waits until the job completes and prints debug out to STDOUT or whatever
// has been configured in your log4j properties.
job.waitForCompletion(true); 

For YARN, the following configurations need to be set.

// this should be like defined in your yarn-site.xml
conf.set("yarn.resourcemanager.address", "yarn-manager.com:50001"); 

//For set to 26GB
conf.set("yarn.nodemanager.resource.memory-mb", "26624"); 


// framework is now "yarn", should be defined like this in mapred-site.xm
conf.set("mapreduce.framework.name", "yarn");

// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");