Sort Order in HBase with Pig/Piglatin in Java

Question

I created a HBase Table in the shell and added some data. In http://hbase.apache.org/book/dm.sort.html is written that the datasets are first sorted by the rowkey and then the column. So I tried something in the HBase Shell:

hbase(main):013:0> put 'mytable', 'key1', 'cf:c', 'val'
0 row(s) in 0.0110 seconds

hbase(main):011:0> put 'mytable', 'key1', 'cf:d', 'val'
0 row(s) in 0.0060 seconds

hbase(main):012:0> put 'mytable', 'key1', 'cf:a', 'val'
0 row(s) in 0.0060 seconds


hbase(main):014:0> get 'mytable', 'key1'
COLUMN                CELL                                                      
 cf:a                 timestamp=1376468325426, value=val                        
 cf:c                 timestamp=1376468328318, value=val                        
 cf:d                 timestamp=1376468321642, value=val                        
3 row(s) in 0.0570 seconds

Everything looks fine. I got the right order a -> c -> d like expected.

Now i tried the same with Apache Pig in Java:

pigServer.registerQuery("mytable_data = load 'hbase://mytable' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf', '-loadKey true') as (rowkey:chararray, columncontent:map[]);");
printAlias("mytable_data"); // own function, which itereate over the keys

I got this result:

(key1,[c#val,d#val,a#val])

So, now the order is c -> d -> a. That seems a little odd to me, shouldn't it be the same like in HBase? It's important for me to get the right order because I transform the map afterwards into a bag and then join it with other tables. If both inputs are sorted I could use a merge join without sorting these to datasets?! So does anyone now how it is possible to get the sorted map (or bag) of the columns?

Whoops, that was a little unclear. So in the output you want, the map values need to be sorted alphabetically? Why not just sort the values in the UDF? — mr2ert
My idea is to join the output and try to improve the join. So I tried a merge join and wondered why the output is not sorted. Sure, I can sort the output by my self but that cost time. And if there is a way to get the Data sorted, it would be faster. — t k

TC1 TC1 · Accepted Answer · 2013-10-22T15:28:31

You're fundamentally misunderstanding something -- the HBaseStorage backend loads each row as a single Tuple. You've told Pig to load the column family cf as a map:[], which is exactly what Pig is doing. A Pig map under the hood is just a java.util.HashMap, which obviously has no order.

There is no way currently in pig to convert the map to a bag, but that should be a trivial UDF to write, barring the null checks and other boilerplate, the body is something like

public DataBag exec(Tuple input) {
    DataBag resultBag = bagFactory.newDefaultBag();
    HashMap<String, Object> map = (HashMap<String, Object>) input.get(0);
    for (Map.Entry<String, Object> entry : map) {
        Tuple t = tupleFactory.newTuple();
        t.append(entry.getKey());
        t.append(entry.getValue().toString());
        resultBag.add(t);
    }
    return resultBag;
}

With that then you can generate a bag{(k:chararray, v:chararray)}, use FLATTEN to get a list of (k:chararray, v:chararray) and ORDER those by k.

As for whether there is a way to get the data sorted -- generally no. If the amount of fields in the column family is not constant or the fields are not always the same / defined, your only options are

transforming the map to a bag of tuples and sorting then
or writing a custom LoadFunc which takes a table, a column family and emits a tuple per KeyValue pair scanned. HBase will ensure the ordering and give you the data in the sorted order you see in the shell, but note that the order is only guaranteed upon loading. Any further transformation you apply ruins that.

Sort Order in HBase with Pig/Piglatin in Java

1 Answers