2
votes

I have a lot of small files (size ~ 1MB) that I need to distribute. It's known that Hadoop and HDFS prefer large files. But I don't know whether this can also be applied to Distributed Cache since the distributed files are stored on local machines.

If they need to be merged, what is the best way to merge files programmatically on HDFS ?

One more question: what are the benefits of using symlink ? Thanks

2

2 Answers

2
votes

You can create an archive (tar or zip) of all your small files and add it to the distributed cache as follows:

DistributedCache.addCacheArchive(new URI("/myapp/myzip.zip", job);

And get the files in your mapper/reducer as follows:

public void configure(JobConf job) {
         // Get the cached archives/files
         File f = new File("./myzip.zip/some/file/in/zip.txt");
       }

Read more here

2
votes

Here is a blog from Cloudera on the small files problem.