Data size increases in hbase

Question

I am trying to import data from MySQL to HBase using sqoop. There are about 9 million records in the MySQL table, size being nearly 1.2GB. The replication factor of the hadoop cluster is three.
Here are the issues I am facing:

The data size after import to hbase is more than 20 GB!!! Ideally it should be close to, say 5GB (1.2G*3 + some overhead)
The HBase table has VERSIONS defined as 1. In case I import the same table again from MySQL, the file size in /hbase/ increases (almost doubles). Although the row count in HBase tables remains same. This seems weird as I am inserting the same rows in HBase, hence the filesize should remain the same, similar to the row count value.

As far as my understanding goes, the file size in the second case shouldn't increase if I am importing the same rowset as max version maintained for each entry should be one only.

Any help would be highly appreciated.

Woot4Moo Woot4Moo · Accepted Answer · 2013-09-06T11:23:24

It depends, according to this blog

So to calculate the record size: Fixed part needed by KeyValue format = Key Length + Value Length + Row Length + CF Length + Timestamp + Key Value = ( 4 + 4 + 2 + 1 + 8 + 1) = 20 Bytes

Variable part needed by KeyValue format = Row + Column Family + Column Qualifier + Value

Total bytes required = Fixed part + Variable part

So for the above example let's calculate the record size: First Column = 20 + (4 + 4 + 10 + 3) = 41 Bytes Second Column = 20 + (4 + 4 + 9 + 3) = 40 Bytes Third Column = 20 + (4 + 4 + 8 + 6) = 42 Bytes

Total Size for the row1 in above example = 123 Bytes

To Store 1 billion such records the space required = 123 * 1 billion = ~ 123 GB

I would presume your calculations are grossly incorrect, perhaps share your schema design with us and we can work out the math.

Data size increases in hbase

2 Answers