1
votes

We have a HBase table with 1 column family and has 1.5 billion records in it.

HBase Row count was retrieved using command

"count '<tablename>'", {CACHE => 1000000}.

And HBase to Hive Mapping was done with the below command.

create external table stagingdata(
rowkey String,
col1 String,
col2 String
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key,
n:col1,
n:col2,
') 
TBLPROPERTIES('hbase.table.name' = 'hbase_staging_data');

But While we retrieve the Hive Row Count using the below command,

select count(*) from stagingdata;

It only shows up 140 million rows in the Hive Mapped Table.

We have tried the similar approach for Smaller HBase with 100 million records and complete records were shown up in Hive Mapped Table.

My Question is why the complete 1.5 billion records are not showing up in Hive?

Are we missing here anything ?

Your Immediate Answer would be highly appreciated. Thanks, Madhu.

1

1 Answers

0
votes

What you see in hive is the latest version per key and not all the versions of a key

there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.

Hive HBase Integration