1
votes

I've a job which takes a HiveQL joining 2 tables (2.5 TB, 45GB), repartitions to 100 and then does some other transformations. This executed fine earlier.

Job stages: Stage 0: hive table 1 scan Stage 1: Hive table 2 scan Stage 2: Tungsten exchange for the join Stage 3: Tungsten exchange for the reparation

Today the job is stuck in Stage 2. Out of 200 tasks which are supposed to be executed none of them have started but 290 have failed due to preempted executors.

On drilling down the stage it says "no metrics reported by executors". Under executors tab I could see 40 executors with active tasks though. Also when the stage 2 starts the shuffle read increases gradually and stops at 45GB and after this I don't see any progress.

Any inputs on how to resolve this issue? I'll try reducing the executor memory to see if resource allocation is the issue.

Thanks.

1

1 Answers

1
votes

Turns out it was a huge dataset and the joins were being re-evaluated during this stage. The tasks were running for long when it was reading the datasets. I persisted the joined dataset to make it progress faster.