4
votes

How do I merge all files in a directory on HDFS, that I know are all compressed, into a single compressed file, without copying the data through the local machine? For example, but not necessarily, using Pig?

As an example, I have a folder /data/input that contains the files part-m-00000.gz and part-m-00001.gz. Now I want to merge them into a single file /data/output/foo.gz

3

3 Answers

4
votes

I would suggest to look at FileCrush (https://github.com/edwardcapriolo/filecrush), a tool to merge files on HDFS using MapReduce. It does exactly what you described and provides several options to deal with compressions and control the number of output files.

  Crush --max-file-blocks XXX /data/input /data/output

max-file-blocks represents the maximum number of dfs blocks per output file. For example, according to the documentation:

With the default value 8, 80 small files, each being 1/10th of a dfs block will be grouped into to a single output file since 8 * 1/10 = 8 dfs blocks. If there are 81 small files, each being 1/10th of a dfs block, two output files will be created. One output file contain the combined contents of 41 files and the second will contain the combined contents of the other 40. A directory of many small files will be converted into fewer number of larger files where each output file is roughly the same size.

1
votes

If you set the Parallel to 1 - then you will have single output file. This can be done in 2 ways:

  1. in your pig add set default_parallel 20; but note that this effect everything in your pig
  2. Change the Parallel for a single operation - like DISTINCT ID PARALLEL 1;

Can read more about Parallel Features

0
votes

I know there's an option to do merging to the local filesystem using "hdfs dfs -getMerge" command. Perhaps you can use that to merge to the local filesystem and then use 'hdfs dfs -copyFromLocal" command to copy it back into hdfs.