So - when Tez chooses number of mappers to run, it looks at the number of containers which can run in parallel (available slots), a wave factor, rack locality of data, FileInputFormat max split size, Tez max grouping size, stripes which can go into splits, uncompressed total data size of columns to be fetched etc - it does not look at the tez container size.
So the calculation of number of mappers results in a input slit length bytes per mapper - which can be estimated (before running the job).
But - how to estimate, the total container size needed (memory) to process that input split ?
I understand the memory needed will depend on
- Input split length raw (bytes)
- Compression (percentage?)
- Any UDF which will be applied on the records (negligible probably)
- Vectorization if being used (boolean)
- Map join if needed (boolean)
- Sorting if needed (boolean)
- Buffer used before writing into disk (percentage?)
But - how can I estimate the container size or rather the heap space needed within container based on input split bytes ?
One way is to look into committed heap bytes of a mapper task after one run.
But is there any formula to estimate the COMMITTED_HEAP_BYTES from INPUT_SPLIT_LENGTH_BYTES based on the above factors or any other factors ?