1
votes

I want to ask why the Hadoop Framework, which implements the MapReduce distributed programming paradigm, uses a Text class to store a String when Java already has Strings implemented for us to use? It seems unnecessarily redundant (lol).

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Text.html

3

3 Answers

4
votes

They have implemented their own class Text for String, LongWritable for Long, IntWritable for Integers.

Purpose behind adding these class is to define their own basic types for optimized network serialization. These are found in the org.apache.hadoop.io package.

This types produces a compact serialized object to makes best use of network bandwidth. And Hadoop is meant to process big data so network bandwidth is the most precious resource they want to use in very effective way. Plus for this class they have reduced the overhead of serialization and deserialization of these object as compared to Java's native types.

1
votes

Redundant???

Let me shed some light. When we talk about distributed systems efficient Serialization/Deserialization plays a vital role. It appears in two quite distinct areas of distributed data processing :

  • IPC
  • Persistent Storage

To be specific to Hadoop, IPC between nodes is implemented using RPCs. The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. So, it is very important to have a solid Serialization/Deserialization framework in order to store and process huge amounts of data efficiently. In general, it is desirable that an RPC serialization format is:

  • Compact
  • Fast
  • Extensible
  • Interoperable

Hadoop uses its own types because developers wanted the storage format to be compact (to make efficient use of storage space), fast (so the overhead in reading or writing terabytes of data is minimal), extensible (so we can transparently read data written in an older format), and interoperable (so we can read or write persistent data using different languages).

Few points to remember before thinking that having dedicated MapReduce types is redundant :

  1. Hadoop’s Writable-based serialization framework provides a more efficient and customized serialization and representation of the data for MapReduce programs than using the general-purpose Java’s native serialization framework.
  2. As opposed to Java’s serialization, Hadoop’s Writable framework does not write the type name with each object expecting all the clients of the serialized data to be aware of the types used in the serialized data. Omitting the type names makes the serialization process faster and results in compact, random accessible serialized data formats that can be easily interpreted by non-Java clients.
  3. Hadoop’s Writable-based serialization also has the ability to reduce the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.

HTH

0
votes

Why can't I use the basic String or Integer classes?

Integer and String implement the standard Serializable-interface of Java . The problem is that MapReduce serializes/deserializes values not utilizing this standard interface but rather an own interface, which is called Writable.

The key and value classes have to be serializable by the framework and hence need to implement
the Writable interface. Additionally, the key classes have to implement the WritableComparable
interface to facilitate sorting by the framework. 

Here is the link to MapReduce Tutorial