I'm playing with flink on yarn for testing purposes I have the following setup:
3 machines on aws (32 cores and 64 GB of memory)
I installed Hadoop 2 with hdfs and yarn services manually (without using EMR).
Machine #1 runs HDFS - (NameNode & SeconderyNameNode) and YARN - (resourcemanager) , defined in masters file
Machine #2 runs HDFS - (datanode) and YARN - (nodemanager) , definded in slaves file
Machine #3 runs HDFS - (datanode) and YARN - (nodemanager) , defined in slaves file
I want to submit Apache flink job that reads about 20GB of logs from hdfs process them and than store the result in cassandra
The problem is that i think i'm doing wrong because the job takes quite a lot of time about an hour, and i think it's not very optimized.
i running flink with the following command:
./flink-1.3.0/bin/flink run -yn 2 -ys 30 -yjm 7000 -ytm 8000 -m yarn-cluster /home/ubuntu/reports_script-1.0-SNAPSHOT.jar
and i'm seeing on flink logs that there are 60 task slots in use, but when i'm looking at yarn page i'm seeing very low usage of vcores and memory
what am i doing wrong?
yarn.containers.vcores
. More Yarn specific configuration options can be found here – Till Rohrmann