I'm new to Hadoop. Recently I'm trying to process (only read) many small files on hdfs/hadoop. The average file size is about 1 kb and the number of files is more than 10M. The program must be written in C++ due to some limitations.
This is just a performance evaluation so I only use 5 machines for data nodes. Each of the data node have 5 data disks.
I wrote a small C++ project to read the files directly from hard disk(not from HDFS) to build the performance base line. The program will create 4 reading threads for each disk. The performance result is to have about 14MB/s per disk. Total throughput is about 14MB/s * 5 * 5 = 350MB/s (14MB/s * 5 disks * 5 machines ).
However, when this program ( still using C++, dynamically linked to libhdfs.so, creating 4*5*5=100 threads) reads files from hdfs cluster, the throughput is about only 55MB/s.
If this programming is triggered in mapreduce (hadoop streamming, 5 jobs, each have 20 threads, total number of threads is still 100), the throughput goes down to about 45MB/s. (I guess it's slow down by some bookkeeping process).
I'm wondering what is the reasonable performance HDFS can prvoide. As you can see, comparing with native code, the data throughput is only about 1/7. Is it the problem of my config? Or HDFS limitation? Or Java limitation? What's the best way for my scenario? Will sequence file help (much)? What is the reasonable throughput comparing to native IO read we can expect?
Here's some of my config:
NameNode heap size 32G.
Job/Task node heap size 8G.
NameNode Handler Count: 128
DataNode Handler Count: 8
DataNode Maximum Number of Transfer Threads: 4096
1GBps ethernet.
Thanks.