I have a 4 GB file that I am trying to share across all mappers through a distributed cache. But I am observing a significant delay in map task attempt starts. Specifically, there is a significant delay between the time I submit my job (through job.waitForCompletion()) and the time the first map starts.
I would like to know what the side effect of having large files in a DistributedCache. How many times is the file on a distributed cache replicated ? Does the number of nodes in a cluster have any effect on this ?
(My cluster has about 13 nodes running on very powerful machines where each machine is able to host close to 10 map slots.)
Thanks