0
votes

I am working on a four node multi cluster in hadoop. I have run a series of experiments with the block sizes as follows and calculated run time as follows.

All of them are performed on 20GB input file. 64MB - 32 min, 128MB - 19 Min, 256MB - 15 min, 1GB - 12.5 min.

Should I proceed further in going for 2GB block size? Also kindly explain an optimal block size if similar operations are performed on 90GB file. Thanks!

1
@Ashrith: I need different answers here. Kindly go through the question again before marking them as duplicate.re3el
The question is very similar to your previous question stackoverflow.com/questions/28134288/block-size-in-hadoop you could have modified the original question instead of creating a new one.Ashrith
Yeah. But, I hardly get answers to my questions after edits. That is something which happens in stack overflow! A question is active only when it is asked. Due to this experience of mine, I have posted another.re3el

1 Answers

0
votes

You should test with 2Gb and compare results.

Only you consider the next: More biggest block size minimize the overhead of create maps tasks, but for non-local tasks, Hadoop need transfer all the block to the remote node (network bandwidth limit here), then more smallest block size perform better here.

In your case, 4 nodes (I assume connected by a switch or router local in a LAN), 2Gb isn't a problem. But the answer isn't true in others enviroments, which more error rate.