0
votes

I'm running java mapreduce against an hbase cluster.

The rowkeys are of the form UUID-yyyymmdd-UUID and groups of rows will have the first UUID (a rowkey prefix) in common. I'll call these rows with the shared prefix a group.

In our hbase cluster we have some groups that contain a lot more data than others. The size of a group could be in the low thousands or it could more than a million.

As I understand, one region will be read by one mapper.

This means regions that contain larger groups get assigned to a single mapper and therefore this single mapper is left to process a lot of data.

I have read about and tested setting the hbase.hregion.max.filesize parameter lower so that regions get split. This does improve performance of the mapreduce job since more mappers are marshaled to process the same data.

However, setting this global max parameter lower can also lead to many more hundreds or thousands of regions, which introduces its own overhead and is not advised.

Now to my question:

Instead of applying a global max, is it possible to split regions based on the rowkey prefix? This way, if a large group hits a certain size it could spill over into another region. But the smaller groups could remain within one region, and keep the overall number regions as low as possible.

Hope this makes sense! Thanks.

1

1 Answers

1
votes

When you create a table in HBase you can split it anyway you want by providing a list of keys (i.e. ranges) in your case if you know in advance the "problematic" key prefixes Here's a simple example in scala - but it is pretty much the same in Java (except some more boilerplate code :) )

  private val admin=new HBaseAdmin(config)

  if (!admin.tableExists(tableName)) createTable()

  private def createTable() {
    val htd = new HTableDescriptor(tableName)
    val hcd = new HColumnDescriptor(TableHandler.FAMILY)


    hcd.setMaxVersions(1)
    htd.addFamily(hcd)
    admin.createTable(htd, calcSplits) // <---- create the table with the splits 
  }

  private def calcSplits = {
    val splits = new Array[Array[Byte]](256)
    var i=0
    for (zones <- 0x00 to 0xff)  {
      val temp =new Array[Byte](1)
      temp(0)=(0xff & zones).asInstanceOf[Byte]
      splits(i) =  temp
      i+=1
    }
    splits
  }

Also when the table is already created you can use the same HBaseAdmin split method to split specific regions