I currently have a pseudo-distributed Hadoop System running. The machine has 8 cores (16 virtual cores), 32 GB Ram.
My input files are between a few MB to ~68 MB (gzipped log files, which get uploaded to my server once they reach >60MB hence no fix max size). I want to run some Hive jobs on about 500-600 of those files.
Due to the incongruent input file size, I havent changed blocksize in Hadoop so far. As I understand best-case scenario would be if blocksize = input file size, but will Hadoop fill that block until its full if the file is less than blocksize? And how does the size and amount of input files affect performance, as opposed to say one big ~40 GB file?
And how would my optimal configuration for this setup look like?
Based on this guide (http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/) I came up with this configuration:
32 GB Ram, with 2 GB reserved for the OS gives me 30720 MB that can be allocated to Yarn containers.
yarn.nodemanager.resource.memory-mb=30720
With 8 cores I thought a maximum of 10 containers should be safe. So for each container (30720 / 10) 3072 MB of RAM.
yarn.scheduler.minimum-allocation-mb=3072
For Map Task Containers I doubled the minimum container size, which would allow for a maximum of 5 Map Tasks
mapreduce.map.memory.mb=6144
And if I want a maximum of 3 Reduce task I allocate:
mapreduce.map.memory.mb=10240
With JVM heap size to fit into the containers:
mapreduce.map.java.opts=-Xmx5120m
mapreduce.reduce.java.opts=-Xmx9216m
Do you think this configuration would be good, or would you change anything, and why?