I've got a table in Hbase let's say "tbl" and I would like to query it using Hive. Therefore I mapped a table to hive as follows:
CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
TBLPROPERTIES("hbase.table.name" = "tbl");
Queries like:
select * from tbl", "select id from tbl", "select id, data
from tbl
are really fast.
But queries like
select id from tbl where substr(id, 0, 5) = "12345"
select id from tbl where data["777"] IS NOT NULL
are incredibly slow.
In the contrary when running from Hbase shell:
"scan 'tbl', {
COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or
"scan 'tbl', { COLUMNS=>'data', "FILTER" =>
FilterList.new([qualifierFilter('777')])}"
it is lightning fast!
When I looked into the mapred job generated by hive on jobtracker I discovered that "map.input.records" counts ALL the items in Hbase table, meaning the job makes a full table scan before it even starts any mappers!! Moreover, I suspect it copies all the data from Hbase table to hdfs to mapper tmp input folder before executuion.
So, my questions are - Why hbase storage handler for hive does not translate hive queries into appropriate hbase functions? Why it scans all the records and then slices them using "where" clause? How can it be improved?
Any suggestions to improve the performance of Hive queries(mapped to HBase Table).
Can we create secondary index on HBase tables?
We are using HBase and Hive integration and trying to tune the performance of Hive queries.