- Am attempting to dump over 10 billion records into hbase which will grow on average at 10 million per day and then attempt a full table scan over the records. I understand that a full scan over hdfs will be faster than hbase.
- Hbase is being used to order the disparate data on hdfs. The application is being built using spark.
- The data is bulk-loaded onto hbase. Because of the various 2G limits, region size was reduced to 1.2G from an initial test of 3G (Still requires a bit more detail investigation).
- scan cache is 1000 and cache blocks is off
- Total hbase size is in the 6TB range, yielding several thousand regions across 5 region servers (nodes). (recommendation is low hundreds).
- The spark job essentially runs across each row and then computes something based on columns within a range.
- Using spark-on-hbase which internally uses the TableInputFormat the job ran in about 7.5 hrs.
- In order to bypass the region servers, created a snapshot and used the TableSnapshotInputFormat instead. The job completed in abt 5.5 hrs.
Questions
When reading from hbase into spark, the regions seem to dictate the spark-partition and thus the 2G limit. Hence problems with caching Does this imply that region size needs to be small ?
The TableSnapshotInputFormat which bypasses the region severs and reads directly from the snapshots, also creates it splits by Region so would still fall into the region size problem above. It is possible to read key-values from hfiles directly in which case the split size is determined by the hdfs block size. Is there an implementation of a scanner or other util which can read a row directly from a hfile (to be specific from a snapshot referenced hfile) ?
Are there any other pointers to say configurations that may help to boost performance ? for instance the hdfs block size etc ? The main use case is a full table scan for the most part.