0
votes

I am running a Hadoop MapReduce job, getting input files from HDFS or Amazon S3. I am wondering if it's possible to know how long does it take for a mapper task to read file from HDFS or S3 to the mapper. I'd like to know the time just for reading data, not include mapper processing time of those data. The result I am looking for is something like MB/second for a certain mapper task, which indicates how fast the mapper can read from HDFS or S3. It's something like a I/O performance.

Thanks.

1

1 Answers

0
votes

Maybe you can just use a unit mapper and set the number of reducer to zero. Then the only thing that is done in your simulation is I/O, there will be no sorting and shuffling. Or if you specifically want to focus on reading then you can replace the unit mapper with a function that doesn't write any output. Next I would set mapred.jvm.reuse=-1, to remove the jvm overhead. It isn't perfect but it is probably the easiest way to have a quick idea. If you want to do it precisely I would consider having a look at implemening your own hadoop counters, but currently I have no expericence with that.