How gzip file gets stored in HDFS

Question

HDFS storage support compression format to store compressed file. I know that gzip compression doesn't support splinting. Imagine now the file is a gzip-compressed file whose compressed size is 1 GB. Now my question is:

How this file will get stored in HDFS (Block size is 64MB)

From this link I came to know that The gzip format uses DEFLATE to store the compressed data, and DEFLATE stores data as a series of compressed blocks.

But I couldn't understand it completely and looking for broad explanation.

More doubts from gzip compressed file:

How many block will be there for this 1GB gzip compressed file.
Will it go on multiple datanode ?
How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)
What is DEFLATE algorithm?
Which algorithm is applied while reading the gzip compressed file?

I am looking here broad and detailed explanation.

A file in a file system does not have to be contiguous on disk, whether the disk is one physical disk, or many disks in a distributed file system. The file system divides the file into blocks, which it stores wherever it decides to store it. When an application requests a file, the file system knows the mapping to the blocks and where the blocks are. It sends an I/O request to retrieve them, then the file system pieces the blocks back into the file. This division of large things is kind of the whole point. A distributed system can pool resources to do things a single system couldn't do alone. — e0k

Ravindra babu Ravindra babu · Accepted Answer · 2016-01-23T06:19:03

How this file will get stored in HDFS (Block size is 64MB) if splitting does not supported for zip file format?

All DFS blocks will be stored in single Datanode. If your block size is 64 MB and file is 1 GB, the Datanode with 16 DFS blocks ( 1 GB / 64 MB = 15.625) will store 1 GB file.

How many block will be there for this 1GB gzip compressed file.

1 GB / 64 MB = 15.625 ~ 16 DFS blocks

How replication factor will be applicable for this file ( Hadoop cluster replication factor is 3.)

Same as of any other file. If the file is splittable, no change. If the file is not splittable, Datanodes with required number of blocks will be identified. In this case, 3 datanodes with 16 available DFS blocks.

From source code of this link : http://grepcode.com/file_/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/hdfs/server/namenode/ReplicationTargetChooser.java/?v=source

and

http://grepcode.com/file_/repo1.maven.org/maven2/org.apache.hadoop/hadoop-hdfs/0.22.0/org/apache/hadoop/hdfs/server/namenode/BlockPlacementPolicyDefault.java/?v=source

/** The class is responsible for choosing the desired number of targets
 * for placing block replicas.
 * The replica placement strategy is that if the writer is on a datanode,
 * the 1st replica is placed on the local machine, 
 * otherwise a random datanode. The 2nd replica is placed on a datanode
 * that is on a different rack. The 3rd replica is placed on a datanode
 * which is on the same rack as the first replca.
 */

What is DEFLATE algorithm?

DELATE is the algorithm to uncompress zipped files of GZIP format.

Have a look at this slide to have understanding of other algorithms for different variants of zip files.

Have a look at this presentation for more details.

How gzip file gets stored in HDFS

1 Answers