Hadoop - appropriate block size for unsplittable files of varying sizes (200-500mb)

Question

If I need to do a sequential scan of (non-splittable) thousands of gzip files of sizes varying between 200 and 500mb, what is an appropriate block size for these files?

For the sake of this question, let's say that the processing done is very fast, so restarting a mapper is not costly, even for large block sizes.

My understanding is:

There's hardly an upper limit of block size, as there's "plenty of files" for an appropriate amount of mappers for the size of my cluster.
To ensure data-locality, I want each gzip file to be in 1 block.

However, the gzipped files are of varying sizes. How is data stored if I choose a block size of ~500mb (e.g. max file size of all my input files)? Would it be better to pick a "very large" block size, like 2GB? Is HDD capacity wasted excessively in either scenario?

I guess I'm really asking how files are actually stored and split across hdfs blocks - as well as trying to gain an understanding of best practice for non-splittable files.

Update: A concrete example

Say I'm running a MR Job on three 200 MB files, stored as in the following illustration.

If HDFS stores files as in case A, 3 mappers would be guaranteed to be able to process a "local" file each. However, if the files are stored as in case B, one mapper would need to fetch part of file 2 from another data node.

Given there's plenty of free blocks, does HDFS store files as illustrated by case A or case B?

HDFS strategies

Chris White Chris White · Accepted Answer · 2013-06-22T12:39:55

If you have non-splittable files then you are better off using larger block sizes - as large as the files themselves (or larger, it makes no difference).

If the block size is smaller than the overall filesize then you run into the possibility that all the blocks are not all on the same data node and you lose data locality. This isn't a problem with splittable files as a map task will be created for each block.

As for an upper limit for block size, i know that for certain older version of Hadoop, the limit was 2GB (above which the block contents were unobtainable) - see https://issues.apache.org/jira/browse/HDFS-96

There is no downside for storing smaller files with larger block sizes - to emphasize this point consider a 1MB and 2 GB file, each with a block size of 2 GB:

1 MB - 1 block, single entry in the Name Node, 1 MB physically stored on each data node replica
2 GB - 1 block, single entry in the Name node, 2 GB physically stored on each data node replica

So other that the required physical storage, there is no downside to the Name node block table (both files have a single entry in the block table).

The only possible downside is the time it takes to replicate a smaller versus larger block, but on the flip side if a data node is lost from the cluster, then tasking 2000 x 1 MB blocks to replicate is slower than a single block 2 GB block.

Update - a worked example

Seeing as this is causing some confusion, heres some worked examples:

Say we have a system with a 300 MB HDFS block size, and to make things simpler we have a psuedo cluster with only one data node.

If you want to store a 1100 MB file, then HDFS will break up that file into at most 300 MB blocks and store on the data node in special block indexed files. If you were to go to the data node and look at where it stores the indexed block files on physical disk you may see something like this:

/local/path/to/datanode/storage/0/blk_000000000000001  300 MB
/local/path/to/datanode/storage/0/blk_000000000000002  300 MB
/local/path/to/datanode/storage/0/blk_000000000000003  300 MB
/local/path/to/datanode/storage/0/blk_000000000000004  200 MB

Note that the file isn't exactly divisible by 300 MB, so the final block of the file is sized as the modulo of the file by the block size.

Now if we repeat the same exercise with a file smaller than the block size, say 1 MB, and look at how it would be stored on the data node:

/local/path/to/datanode/storage/0/blk_000000000000005  1 MB

Again note that the actual file stored on the data node is 1 MB, NOT a 200 MB file with 299 MB of zero padding (which i think is where the confusion is coming from).

Now where the block size does play a factor in efficiency is in the Name Node. For the above two examples, the name node needs to maintain a map of the file names, to block names and data node locations (as well as the total file size and block size):

filename     index     datanode
-------------------------------------------
fileA.txt    blk_01    datanode1
fileA.txt    blk_02    datanode1
fileA.txt    blk_03    datanode1
fileA.txt    blk_04    datanode1
-------------------------------------------
fileB.txt    blk_05    datanode1

You can see that if you were to use a block size of 1 MB for fileA.txt, you'd need 1100 entries in the above map rather than 4 (which would require more memory in the name node). Also pulling back all the blocks would be more expensive as you'd be making 1100 RPC calls to datanode1 rather than 4.

Hadoop - appropriate block size for unsplittable files of varying sizes (200-500mb)

2 Answers