Redundant???
Let me shed some light. When we talk about distributed systems efficient Serialization/Deserialization plays a vital role. It appears in two quite distinct areas of distributed data processing :
To be specific to Hadoop, IPC between nodes is implemented using RPCs. The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. So, it is very important to have a solid Serialization/Deserialization framework in order to store and process huge amounts of data efficiently. In general, it is desirable that an RPC serialization format is:
-
Compact
-
Fast
-
Extensible
-
Interoperable
Hadoop uses its own types because developers wanted the storage format to be compact (to make efficient use of storage space), fast (so the overhead in reading or writing terabytes of data is minimal), extensible (so we can transparently read data written in an older format), and interoperable (so we can read or write persistent data using different languages).
Few points to remember before thinking that having dedicated MapReduce types is redundant :
- Hadoop’s Writable-based serialization framework provides a more efficient and customized serialization and representation of the data for MapReduce programs than using the general-purpose Java’s native serialization framework.
- As opposed to Java’s serialization, Hadoop’s Writable framework does not write the type name with each object expecting all the clients of the serialized data to be aware of the types used in the serialized data. Omitting the type names makes the serialization process faster and results in compact, random accessible serialized data formats that can be easily interpreted by non-Java clients.
- Hadoop’s Writable-based serialization also has the ability to reduce the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.
HTH