I need to load from Pig to HBase using HBaseStorage and I can't figure out how to load with a variable number of columns for a specific column family.(With a known number of columns it is straightforward)
Data that looks like this: (spaces added for readibility)
Id,ItemId,Count,Date
1 ,1 ,2 ,2015-02-01
2 ,2 ,2 ,2015-02-02
3 ,1 ,2 ,2015-02-03
And I have an HBase table with rowkey and one column family called Attributes. So I load first the csv using:
A = LOAD 'items.csv' USING PigStorage(',')
as (Id,ItemId,Count:chararray, CreationDate:chararray);
And now I want to group them by ItemId so I do the following:
B = FOREACH A GENERATE ItemId, TOTUPLE(Date, Count);
C = GROUP B BY ItemId
So I get my data nicely grouped, with the key and then the tuples with Date and Count:
1 {(2015-02-03, 2),(2015-02-01, 2)}
2 {(2015-02-02, 2)}
And what I am aiming for in HBase is to have one row with two columns, with the date and count:
Rowkey = 1 (Attributes.2015-02-03,2) (Attributes.2015-02-01,2)
Rowkey = 2 (Attributes.2015-02-02,2)
And this is the part I am struggling with, how do I define that I have a variable number of columns? I have tried the following as well as multiple other combinations:
STORE onlygroups into 'hbase://mytable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('Attributes:*');
But get several errors, for example this one:
ERROR 2999: Unexpected internal error. org.apache.pig.data.InternalCachedBag
cannot be cast to java.util.Map
I have also tried using TOMAP but does not work either. Any suggestions?
Note: the recommended solution identified as duplicate does not solve my issue, it basically recommends using MapReduce and my data structure is different.