How to unzip large xml files into one HDFS directory

Question

I have a requirement to load Zip files from HDFS directory, unzip it and write back to HDFS in a single directory with all unzipped files. The files are XML and the size varies in GB.

Firstly, I approached by implementing the Map-Reduce program by writing a custom InputFormat and Custom RecordReader to unzip the files and provide these contents to mapper, thereafter each mapper process and writes to HDFS using MultiOutput Format. The map reduce job running on YARN.

This approach works fine and able to get files in unzipped format in HDFS, when the input size is in MB's, but when the input size is in GB's, the job is failing to write and ended up with the following error.

17/06/16 03:49:44 INFO mapreduce.Job:  map 94% reduce 0%
17/06/16 03:49:53 INFO mapreduce.Job:  map 100% reduce 0%
17/06/16 03:51:03 INFO mapreduce.Job: Task Id : attempt_1497463655394_61930_m_000001_2, Status : FAILED
Container [pid=28993,containerID=container_e50_1497463655394_61930_01_000048] is running beyond physical memory limits. Current usage: 2.6 GB of 2.5 GB physical memory used; 5.6 GB of 12.5 GB virtual memory used. Killing container.

It is apparent that each unzipped file is processed by one mapper and yarn child container running mapper not able to hold the large file in the memory.

On the other hand, I would to like try on Spark, to unzip the file and write the unzipped files to a single HDFS directory running on YARN, I wonder with spark also, each executor has to process the single file.

I'm looking for the solution to process the files parallelly, but at the end write it to a single directory.

Please let me know this can be possible in Spark, and share me some code snippets.

Any help appreciated.

IS this potenentially a duplicate of stackoverflow.com/questions/38101857/… ? — Rick Moritz

mrsrinivas mrsrinivas · Accepted Answer · 2017-07-12T13:21:09

Actually, the task itself is not failing! YARN is killing the container (inside map task is running) as that Yarn child using more memory than requested memory from YARN. As you are planning to do it in Spark, you can simply increase the memory to MapReduce tasks.

I would recommend you to

Increase YARN child memory as you are handling GBs of data, Some key properties
- yarn.nodemanager.resource.memory-mb => Container Memory
- yarn.scheduler.maximum-allocation-mb => Container Memory Maximum
- mapreduce.map.memory.mb => Map Task Memory (Must be less then yarn.scheduler.maximum-allocation-mb at any pint of time in runtime)
Focus on data processing(Unzip) only for this job, invoke another job/command to merge files.

How to unzip large xml files into one HDFS directory

1 Answers