Compression in Hadoop Sequence File

Question

I have some basic questions about the hadoop sequential file.

1) To what extent the default compression codec compresses the file?

2) I have hadoop sequence file of 100 MB when i read this file and dump its content to text file size of text file i observed is around 1GB(Is it Ok?)

3)While reading the sequence file what is the significance of "syncSeen()" and "seek(long position)" ? Is there any problem if i do not use these calls while reading? any example on how to use these methods?

Praveen Sripati Praveen Sripati · Accepted Answer · 2011-11-29T12:38:30

SequenceFile.Reader#seek will position the reader at the given point in the SequenceFile.

According to the Hadoop:The Definitive Guide

A sync point is a point in the stream that can be used to resynchronize with a record boundary if the reader is “lost”—for example, after seeking to an arbitrary position in the stream. Sync points are recorded by SequenceFile.Writer, which inserts a special entry to mark the sync point every few records as a sequence file is being written. Such entries are small enough to incur only a modest storage overhead—less than 1%. Sync points always align with record boundaries.

SequenceFile.Reader#syncseen will tell if a sync mark has been passed while reading a SequenceFile.

Compression in Hadoop Sequence File

1 Answers