2
votes

So - when Tez chooses number of mappers to run, it looks at the number of containers which can run in parallel (available slots), a wave factor, rack locality of data, FileInputFormat max split size, Tez max grouping size, stripes which can go into splits, uncompressed total data size of columns to be fetched etc - it does not look at the tez container size.

So the calculation of number of mappers results in a input slit length bytes per mapper - which can be estimated (before running the job).

But - how to estimate, the total container size needed (memory) to process that input split ?

I understand the memory needed will depend on

  1. Input split length raw (bytes)
  2. Compression (percentage?)
  3. Any UDF which will be applied on the records (negligible probably)
  4. Vectorization if being used (boolean)
  5. Map join if needed (boolean)
  6. Sorting if needed (boolean)
  7. Buffer used before writing into disk (percentage?)

But - how can I estimate the container size or rather the heap space needed within container based on input split bytes ?

One way is to look into committed heap bytes of a mapper task after one run.

But is there any formula to estimate the COMMITTED_HEAP_BYTES from INPUT_SPLIT_LENGTH_BYTES based on the above factors or any other factors ?

1

1 Answers

0
votes

I don't think input split length per mapper affects Tez container size directly. It just means the split will be processed by one mapper, but it doesn't mean the whole split will be loaded into memory at once. So the split length could be much larger than the Tez container size which runs the mapper.

As a general guideline,

Set Tez container size to be the same as or a small multiple (1 or 2 times that) of YARN container size yarn.scheduler.minimum-allocation-mb but NEVER more than yarn.scheduler.maximum-allocation-mb. You want to have headroom for multiple containers to be spun up.

See more details in this doc.