1
votes

I am working on an use case and help me in improving the scan performance.

Customers visiting our website are generated as logs and we will be processing it which is usually done by Apache Pig and inserts the output from pig into hbase table(test) directly using HbaseStorage. This will be done every morning. Data consists of following columns

Customerid | Name | visitedurl | timestamp | location | companyname

I have only one column family (test_family)

As of now I have generated random no for each row and it is inserted as row key for that table. For ex I have following data to be inserted into table

1725|xxx|www.something.com|127987834 | india |zzzz
1726|yyy|www.some.com|128389478 | UK | yyyy

If so I will add 1 as row key for first row and 2 for second one and so on.

Note : Same id will be repeated for different days so I chose random no to be row-key

while querying data from table where I use scan 'test', {FILTER=>"SingleColumnValueFilter('test_family','Customerid',=,'binary:1002')"} it takes more than 2 minutes to return the results.`

Suggest me a way so that I have to bring down this process to 1 to 2 seconds since I am using it in real-time analytics

Thanks

1
HBase is not designed for this kind of queries. Probably you can use MySQL instead?kostya

1 Answers

1
votes

As per the query you have mentioned, I am assuming you need records based on Customer ID. If it is correct, then, to improve the performance, you should use Customer ID as Row Key.

However, multiple entries could be there for single Customer ID. So, better design Row key as CustomerID|unique number. This unique number could be the timestamp too. It depends upon your requirements.

To scan the data in this case, you need to use PrefixFilter on row key. This will give you better performance.

Hope this help..