TEZ mapper resource request

Question

We recently migrated from MapReduce to TEZ for executing Hive queries on EMR. We are seeing cases where for the exact hive query launches very different number of mappers. See Map 3 phase below. On the first run it requested for 305 resources and on another run it requested for 4534 mappers. ( Please ignore the KILLED status because I manually killed the query.) Why does this happen ? How can we change it to be based on underlying data size instead ?

Run 1

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1            container        KILLED      5          0        0        5       0       0  
Map 3            container        KILLED    305          0        0      305       0       0  
Map 5            container        KILLED     16          0        0       16       0       0  
Map 6            container        KILLED      1          0        0        1       0       0  
Reducer 2        container        KILLED    333          0        0      333       0       0  
Reducer 4        container        KILLED    796          0        0      796       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 00/06  [>>--------------------------] 0%    ELAPSED TIME: 14.16 s    
----------------------------------------------------------------------------------------------

Run 2

----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      5          5        0        0       0       0  
Map 3            container        KILLED   4534          0        0     4534       0       0  
Map 5 .......... container     SUCCEEDED    325        325        0        0       0       0  
Map 6 .......... container     SUCCEEDED      1          1        0        0       0       0  
Reducer 2        container        KILLED    333          0        0      333       0       0  
Reducer 4        container        KILLED    796          0        0      796       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 03/06  [=>>-------------------------] 5%    ELAPSED TIME: 527.16 s   
----------------------------------------------------------------------------------------------

kvb kvb · Accepted Answer · 2019-04-10T01:04:58

This article explains the process in which Tez allocates resources. https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

If Tez grouping is enabled for the splits, then a generic grouping logic is run on these splits to group them into larger splits. The idea is to strike a balance between how parallel the processing is and how much work is being done in each parallel process.

First, Tez tries to find out the resource availability in the cluster for these tasks. For that, YARN provides a headroom value (and in future other attributes may be used). Lets say this value is T.

Next, Tez divides T with the resource per task (say M) to find out how many tasks can run in parallel at one (ie in a single wave). W = T/M.

Next W is multiplied by a wave factor (from configuration - tez.grouping.split-waves) to determine the number of tasks to be used. Lets say this value is N.

If there are a total of X splits (input shards) and N tasks then this would group X/N splits per task. Tez then estimates the size of data per task based on the number of splits per task.

If this value is between tez.grouping.max-size & tez.grouping.min-size then N is accepted as the number of tasks. If not, then N is adjusted to bring the data per task in line with the max/min depending on which threshold was crossed.

For experimental purposes tez.grouping.split-count can be set in configuration to specify the desired number of groups. If this config is specified then the above logic is ignored and Tez tries to group splits into the specified number of groups. This is best effort.

After this the grouping algorithm is executed. It groups splits by node locality, then rack locality, while respecting the group size limits.

TEZ mapper resource request

1 Answers