1
votes

For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?

1

1 Answers

1
votes

Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.

Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.

If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.

As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.

To create queue you can do this via the Ambari > Yarn Queue Manager.

To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.

More info on WebHCat config: https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure