Large number of small files Hadoop

Question

Parameters of some machines are measured and uploaded via a web service to HDFS. Parameter values are saved in a file for each measurement, where a measurement has 1000 values in average.

The problem is - there is a large number of files. Only certain number of files is used for MapReduce job (for example, measurements from last month). Because of this I'm not able to merge them all into one large sequence file, since different files are needed in different time.

I understand that is bad to have a large number of small files, since the NameNode contains paths to all of them on HDFS (and keeps it in its memory) and on the other hand, each small file will result in a Mapper creation.

How can I avoid this problem?

Did you try to use CombineSequenceFileInputFormat? It should combine small files into one split and create smaller number of mappers. Documentation: hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/… — Aleksei Shestakov
To be honest, I'm not very experienced with Hadoop, but I understand that there may be some problems with that approach. For example, the references to all the files on HDFS would be still saved in NameNode memory, is that correct? Are there other problems as well, when using CombineSequenceFileInputFormat? — Kobe-Wan Kenobi
Yes, storage of a large amount of small files in HDFS is bad idea. You can merge small files into one sequence file per hour (or day). If you will use file's timestamp as key and file's content as value then in mapper you will be able to filter files that not included in specified time range. — Aleksei Shestakov
So you are suggesting to run a MapReduce job and emit only the files in the specified range from the mapper? But I'll still have the problem of many mapper tasks for that job, I guess that such a thing could be tolerated? And on the other hand, I will need to leave the original files on HDFS, to merge them next time I need it, which will have an impact on the NameNode the whole time. Any comment about that? I guess I'll have to do something like that, if there isn't a better solution. What do you think of using HBase or something like that, to query by timestamp? Would same problems exist? — Kobe-Wan Kenobi
you could try HAR(hadoop archive) to pack small files into single archieve so that reduces the overhead of NameNode to maintain too many small files, and use CombineFileInputFormat over HAR to limit the number of mappers dispatched. — suresiva

chrislusf chrislusf · Accepted Answer · 2020-09-25T20:29:12

A late answer: You can use SeaweedFS https://github.com/chrislusf/seaweedfs (I am working on this). It has special optimization for large number of small files.

HDFS actually has good support to delegate file storage to other file systems. Just add a SeaweedFS hadoop jar. See https://github.com/chrislusf/seaweedfs/wiki/Hadoop-Compatible-File-System

Large number of small files Hadoop

2 Answers