2
votes

I wrote a simple hash join program in hadoop map reduce. The idea is the following:

A small table is distributed to every mapper using DistributedCache provided by hadoop framework. The large table is distributed over the mappers with the split size being 64M. The setup code of the mapper creates a hashmap reading every line from this small table. In the mapper code, every key is searched(get) on the hashmap, and if the key exists in the hash map it is written out. There is no need of a reducer at this point of time. This is the code which we use:

    public class Map extends Mapper<LongWritable, Text, Text, Text> {
        private HashMap<String, String> joinData = new HashMap<String, String>();

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {

            String textvalue = value.toString();
            String[] tokens;
            tokens = textvalue.split(",");
            if (tokens.length == 2) {
                String joinValue = joinData.get(tokens[0]);
                if (null != joinValue) {
                    context.write(new Text(tokens[0]), new Text(tokens[1] + ","
                            + joinValue));
                }
            }
        }

    public void setup(Context context) {
        try {
            Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
                    .getConfiguration());
            if (null != cacheFiles && cacheFiles.length > 0) {
                String line;
                String[] tokens;
                BufferedReader br = new BufferedReader(new FileReader(
                        cacheFiles[0].toString()));
                try {
                    while ((line = br.readLine()) != null) {

                        tokens = line.split(",");
                        if (tokens.length == 2) {
                            joinData.put(tokens[0], tokens[1]);
                        }
                    }
                    System.exit(0);
                } finally {
                    br.close();
                }
            }

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

While testing this code, our small table was 32M, and large table was 128M, one master and 2 slave nodes.

This code fails with the above inputs when I have a 256M of heap. I use -Xmx256m in the mapred.child.java.opts in mapred-site.xml file. When I increase it to 300m it proceeds very slowly and with 512m it reaches its max throughput.

I dont understand where my mapper is consuming so much memory. With the inputs given above and with the mapper code I dont expect my heap memory to ever reach 256M, yet it fails with java heap space error.

I will be thankful if you can give some insight into why the mapper is consuming so much memory.

EDIT:

13/03/11 09:37:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/03/11 09:37:33 INFO input.FileInputFormat: Total input paths to process : 1
13/03/11 09:37:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/03/11 09:37:33 WARN snappy.LoadSnappy: Snappy native library not loaded
13/03/11 09:37:34 INFO mapred.JobClient: Running job: job_201303110921_0004
13/03/11 09:37:35 INFO mapred.JobClient:  map 0% reduce 0%
13/03/11 09:39:12 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000000_0, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:40:43 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_0, Status : FAILED
org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: File /usr/home/hadoop/hadoop-1.0.3/libexec/../logs/userlogs/job_201303110921_0004/attempt_201303110921_0004_m_000001_0/log.tmp already exists
    at org.apache.hadoop.io.SecureIOUtils.insecureCreateForWrite(SecureIOUtils.java:130)
    at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:157)
    at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
    at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:257)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

attempt_201303110921_0004_m_000001_0: Exception in thread "Thread for syncLogs" java.lang.OutOfMemoryError: Java heap space
attempt_201303110921_0004_m_000001_0:   at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:76)
attempt_201303110921_0004_m_000001_0:   at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:59)
attempt_201303110921_0004_m_000001_0:   at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:312)
attempt_201303110921_0004_m_000001_0:   at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:385)
attempt_201303110921_0004_m_000001_0:   at org.apache.hadoop.mapred.Child$3.run(Child.java:141)
attempt_201303110921_0004_m_000001_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201303110921_0004_m_000001_0: log4j:WARN Please initialize the log4j system properly.
13/03/11 09:42:18 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_1, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:43:48 INFO mapred.JobClient: Task Id : attempt_201303110921_0004_m_000001_2, Status : FAILED
Error: GC overhead limit exceeded
13/03/11 09:45:09 INFO mapred.JobClient: Job complete: job_201303110921_0004
13/03/11 09:45:09 INFO mapred.JobClient: Counters: 7
13/03/11 09:45:09 INFO mapred.JobClient:   Job Counters 
13/03/11 09:45:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=468506
13/03/11 09:45:09 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/03/11 09:45:09 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/03/11 09:45:09 INFO mapred.JobClient:     Launched map tasks=6
13/03/11 09:45:09 INFO mapred.JobClient:     Data-local map tasks=6
13/03/11 09:45:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/03/11 09:45:09 INFO mapred.JobClient:     Failed map tasks=1
1

1 Answers

4
votes

It's hard to say for sure where the memory consumption is going, but here are a few pointers:

  • You're creating 2 Text objects for every line of your input. You should just use 2 Text objects that will be initialized once in your Mapper as class variables, and then for each line just call text.set(...). This is a common usage pattern for Map/Reduce patterns, and can save quite a bit of memory overhead.
  • You should consider using SequenceFile format for your input, which would avoid the need to parse the lines with textValue.split, you would instead have this data directly available as an array. I've read several times that doing string splits like this can be quite intensive, so you should avoid as much as possible if memory is really an issue. You can also think about using KeyValueTextInputFormat if, as in your example, you only care about key/value pairs.

If that isn't enough, I would advise looking at this link, especially part 7 which gives you a very simple method to profile your application and see what gets allocated where.