2
votes

Well the the title of the questions says it all. I have a requirement which requires getting row keys corresponding to top X (say top 10) values in certain column. Thus, I need to sort hbase rows by the desired column values. I don't understand how should I do this or even is doable or not. It seems that hbase does not cater to this very well. Also it does not allow any such functionality out of the box.

Q1. Can I use hbase-spark connector, load whole hbase data in spark rdd and then perform sorting in it? Will this be fast? How the connector and spark will handle it? Will it fetch whole data on single node or multiple nodes and sort in distributed manner?

Q2. Also is there any better way to do this?

Q3. Is it undoable in hbase at all? and should I opt for different framework/technology altogether?

2
Fundamentally no, and that's just an aspect of how HBase stores data. If you want this to be fast, store your data in a columnar format like Parquet. HBase is heavily optimised for random access: choose the data store for your use case.user1310957

2 Answers

2
votes

A3. If you need to sort your data by some column (not row-key), you get no benefit from using HBase. It'll be the same as reading raw files from hive/hdfs and sort, but slower.

A1. Sure you can use SHC or any other spark-hbase library for that matter, but A3 still holds. It will load the entire data on every region server as Spark RDD, only to shuffle it across your entire cluster.

A2. As any other programming/architecture issue, there are many possible solutions depending on your resources and requirements.


Will spark load all the data on single node and do sorting on single node or will it perform sorting on different nodes?

It depends on two factors:

  • How many regions your table has: This determines the parallelism degree (number of partitions) for reading from your table.
  • spark.sql.shuffle.partitions configuration value: After loading the data from the table, this value determines the parallelism degree for the sorting stage.

is there any better [library] than the SHC?

As for today there are multiple libraries for integrating Spark with HBase, each has its own pros and cons, and TMO none of them is fully mature or gives full coverage (compared Spark-Hive integration, for example). To get the best from Spark over HBase you should have a very good understanding of your use case and select the most suitable library.

0
votes

Q2. Also is there any better way to do this?

If re-designing your HBase table is an option with this specific column value as part of the rowkey, this would allow fast access to these values as HBase is optimised for rowkey filters and not column filters.

You could then create a rowkey concatenation of the existing_rowkey + this_col_value. Querying it then with a Row Filter would have better performance results.