0
votes

I am running a query using Hive on Spark which is exhibiting some strange behavior. I've run it multiple times and observed the same behavior. The query:

  • reads from a large Hive external table
  • Spark creates about ~990,000 tasks
  • runs in a YARN queue with > 2900 CPUs available
  • uses 700 executors with 4 CPUs per executor

All is well at the start of the job. After ~1.5 hours of 2800 CPUs cranking, the job is ~80% complete (800k/990k tasks). From there, things start to nosedive: Spark stops using all of the CPUs available to it to work on tasks. With ~190k tasks to go, Spark will gradually drop from using 2800 CPUs to double digits (usually bottoming out around 20 total CPUs). This makes the last 190k tasks take significantly longer to finish than the previous 800k.

I could see as the job got very close to completing that Spark would be unable to parallelize a small amount of remaining tasks across a large number of CPUs. But with 190k tasks left to be started, it seems way too early for that.

Things I've checked:

  • No other job is pre-empting its resources in YARN. (In addition, if this were the case, I would expect the job to randomly lose/regain resources, instead of predictably losing steam at the 80% mark).
  • This occurs whether dynamic allocation is enabled or disabled. If disabled, Spark has all 2800 CPUs available for the entire run time of the job - it just doesn't use them. If enabled, Spark does spin down executors as it decides it no longer needs them.
  • If data skew were the issue, I could see some tasks taking longer than others to finish. But it doesn't explain why Spark wouldn't be using idle CPUs to start on the backlog of tasks still to go.

Does anyone have any advice?

1

1 Answers

1
votes

For posterity, this answer from Travis Hegner contained the answer.

Setting spark.locality.wait=0s fixes this issue. I'm also not sure why a 3 second wait causes such a pile up in Spark's ability to schedule tasks, but setting to 0 makes the job run extremely well.