2
votes

I have a hbase table which contains contact information of customers. This table contains around 700k rows. I have a script which has to query customers table to find matches for 2000-3000 records. Each scan takes around 1s to complete. So for 2000 records it takes 33 minutes to complete. I wanted to see if i can improve this performance. I have tried setting caching but it didnt help. Here are the details. I have only one column family on customers table and customer id is the row key. My query looks likes this.

SingleColumnValueFilter('internal', 'country', =, 'binary:GB') AND SingleColumnValueFilter('internal', 'postcode', =, 'binary:W24RT') AND SingleColumnValueFilter('internal', 'street', =, 'binary:bayswaterroad')

How can i improve the performance?

1
Have you considered using a relational database (e.g. MySQL) to store your data? It will be much more efficient for this task and your data size.kostya

1 Answers

3
votes

The best performance from Hbase comes when you design your row key based on your query requirements. And when you search based on that row key, you end up with minimum time. So one option is optimizing your row key.

Also you are including 3 column value filters, and so for each scan, it has a lookup of 3 times.

You can add more options related to excluding unmatched rows.