I created a HBase Table in the shell and added some data. In http://hbase.apache.org/book/dm.sort.html is written that the datasets are first sorted by the rowkey and then the column. So I tried something in the HBase Shell:
hbase(main):013:0> put 'mytable', 'key1', 'cf:c', 'val'
0 row(s) in 0.0110 seconds
hbase(main):011:0> put 'mytable', 'key1', 'cf:d', 'val'
0 row(s) in 0.0060 seconds
hbase(main):012:0> put 'mytable', 'key1', 'cf:a', 'val'
0 row(s) in 0.0060 seconds
hbase(main):014:0> get 'mytable', 'key1'
COLUMN CELL
cf:a timestamp=1376468325426, value=val
cf:c timestamp=1376468328318, value=val
cf:d timestamp=1376468321642, value=val
3 row(s) in 0.0570 seconds
Everything looks fine. I got the right order a -> c -> d like expected.
Now i tried the same with Apache Pig in Java:
pigServer.registerQuery("mytable_data = load 'hbase://mytable' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf', '-loadKey true') as (rowkey:chararray, columncontent:map[]);");
printAlias("mytable_data"); // own function, which itereate over the keys
I got this result:
(key1,[c#val,d#val,a#val])
So, now the order is c -> d -> a. That seems a little odd to me, shouldn't it be the same like in HBase? It's important for me to get the right order because I transform the map afterwards into a bag and then join it with other tables. If both inputs are sorted I could use a merge join without sorting these to datasets?! So does anyone now how it is possible to get the sorted map (or bag) of the columns?