Parameters of some machines are measured and uploaded via a web service to HDFS. Parameter values are saved in a file for each measurement, where a measurement has 1000 values in average.
The problem is - there is a large number of files. Only certain number of files is used for MapReduce job (for example, measurements from last month). Because of this I'm not able to merge them all into one large sequence file, since different files are needed in different time.
I understand that is bad to have a large number of small files, since the NameNode contains paths to all of them on HDFS (and keeps it in its memory) and on the other hand, each small file will result in a Mapper creation.
How can I avoid this problem?
CombineSequenceFileInputFormat
? It should combine small files into one split and create smaller number of mappers. Documentation: hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/… – Aleksei Shestakov