1
votes

I currently have a pseudo-distributed Hadoop System running. The machine has 8 cores (16 virtual cores), 32 GB Ram.

My input files are between a few MB to ~68 MB (gzipped log files, which get uploaded to my server once they reach >60MB hence no fix max size). I want to run some Hive jobs on about 500-600 of those files.

Due to the incongruent input file size, I havent changed blocksize in Hadoop so far. As I understand best-case scenario would be if blocksize = input file size, but will Hadoop fill that block until its full if the file is less than blocksize? And how does the size and amount of input files affect performance, as opposed to say one big ~40 GB file?

And how would my optimal configuration for this setup look like?

Based on this guide (http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/) I came up with this configuration:

32 GB Ram, with 2 GB reserved for the OS gives me 30720 MB that can be allocated to Yarn containers.

yarn.nodemanager.resource.memory-mb=30720

With 8 cores I thought a maximum of 10 containers should be safe. So for each container (30720 / 10) 3072 MB of RAM.

yarn.scheduler.minimum-allocation-mb=3072

For Map Task Containers I doubled the minimum container size, which would allow for a maximum of 5 Map Tasks

mapreduce.map.memory.mb=6144

And if I want a maximum of 3 Reduce task I allocate:

mapreduce.map.memory.mb=10240

With JVM heap size to fit into the containers:

mapreduce.map.java.opts=-Xmx5120m
mapreduce.reduce.java.opts=-Xmx9216m

Do you think this configuration would be good, or would you change anything, and why?

1
Link to hortonworks is dead. Can we have an equivalent ?Itération 122442

1 Answers

2
votes

Yeah, this configuration is good. But few changes I would like to mention.

For reducer memory, it should be mapreduce.reduce.memory.mb=10240(I think its just a typo.)

Also one major addition I will suggest will be the cpu configuration.

you should put

Container Virtual CPU Cores=15

for Reducer as you are running only 3 reducers, you can give

Reduce Task Virtual CPU Cores=5

And for Mapper

Mapper Task Virtual CPU Cores=3

number of containers that will be run in parallel in (reducer OR mapper) = min(total ram / mapreduce.(reduce OR map).memory.mb, total cores/ (Map OR Reduce) Task Virtual CPU Cores).

Please refer http://openharsh.blogspot.in/2015/05/yarn-configuration.html for detailed understading.