0
votes

Currently I'm using Sequence File to compress our existing HDFS data.

Now I have two options to store this Sequence File as

  • A single large file, which means all records go to this file.
  • Multi small files, and each file's size exactly match the HDFS block size (128MB)

As we know, the HDFS files stored as block, each block goes to one mapper. So I think there's no different when MR processing against that Sequence File(s).

The only one disadvantage I know for option two is namenode needs more overhead to maintain those files, whereas there's only one file for option one.

I am comfusing about these two options since I saw too many articles recommend that

  • make your HDFS file's size match the block size as possible as you can.
  • Merge the small files to a single large file as possible as you can.

Can anyone point me the correct way to do this? which is better? Any advantage/disadvantage for these two options? Thanks!

1

1 Answers

4
votes

Quora.com has one question about (for old version as 128MB is now default block size) why 64MB chosen as default chunk size, though question is relatively different but the answer from Ted Dunning has answer for your question too. Ted Dunning wrote:

The reason Hadoop chose 64MB was because Google chose 64MB. The reason Google chose 64MB was due to a Goldilocks argument.

  1. Having a much smaller block size would cause seek overhead to increase.
  2. Having a moderately smaller block size makes map tasks run fast enough that the cost of scheduling them becomes comparable to the cost of running them.
  3. Having a significantly larger block size begins to decrease the available read parallelism available and may ultimately make it hard to schedule tasks local to the tasks.

So i think point 2 & 3 has answer for you and now you have to decide based on your requirement to store file as one single big file or in smaller chunks of 128MB (Ya if you can increase the block size too if you want).