Currently I'm using Sequence File
to compress our existing HDFS data.
Now I have two options to store this Sequence File
as
- A single large file, which means all records go to this file.
- Multi small files, and each file's size exactly match the HDFS block size (128MB)
As we know, the HDFS files stored as block, each block goes to one mapper. So I think there's no different when MR processing against that Sequence File(s).
The only one disadvantage I know for option two is namenode needs more overhead to maintain those files, whereas there's only one file for option one.
I am comfusing about these two options since I saw too many articles recommend that
- make your HDFS file's size match the block size as possible as you can.
- Merge the small files to a single large file as possible as you can.
Can anyone point me the correct way to do this? which is better? Any advantage/disadvantage for these two options? Thanks!