1
votes

I need to load from Pig to HBase using HBaseStorage and I can't figure out how to load with a variable number of columns for a specific column family.(With a known number of columns it is straightforward)

Data that looks like this: (spaces added for readibility)

Id,ItemId,Count,Date
1 ,1     ,2    ,2015-02-01
2 ,2     ,2    ,2015-02-02
3 ,1     ,2    ,2015-02-03

And I have an HBase table with rowkey and one column family called Attributes. So I load first the csv using:

A = LOAD 'items.csv' USING PigStorage(',') 
as (Id,ItemId,Count:chararray, CreationDate:chararray);

And now I want to group them by ItemId so I do the following:

B = FOREACH A GENERATE ItemId, TOTUPLE(Date, Count);

C = GROUP B BY ItemId

So I get my data nicely grouped, with the key and then the tuples with Date and Count:

1   {(2015-02-03, 2),(2015-02-01, 2)}
2   {(2015-02-02, 2)}

And what I am aiming for in HBase is to have one row with two columns, with the date and count:

Rowkey = 1 (Attributes.2015-02-03,2) (Attributes.2015-02-01,2)
Rowkey = 2 (Attributes.2015-02-02,2)

And this is the part I am struggling with, how do I define that I have a variable number of columns? I have tried the following as well as multiple other combinations:

STORE onlygroups into 'hbase://mytable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('Attributes:*');

But get several errors, for example this one:

ERROR 2999: Unexpected internal error. org.apache.pig.data.InternalCachedBag 
    cannot be cast to java.util.Map

I have also tried using TOMAP but does not work either. Any suggestions?

Note: the recommended solution identified as duplicate does not solve my issue, it basically recommends using MapReduce and my data structure is different.

1
Possible duplicate of Apache Pig: Dynamic columnsRahul Sharma
Thanks @RahulSharma but I already tried it and did not work. Also on that one it says to try MapReduce, not really solving it with Pig.xmorera
HbaseStorage can adds dynamically columns but here your each record has id and bag of tuples which is causing the error. In this case you have to write your UDF to explode PIG bag to individual tuples then try it.Rahul Sharma

1 Answers

1
votes

In order to load data to HBase your data in PIG should be in the following format:

tuple(key, map(col_qual, col_value))

In your case:

(1,[2015-02-03#2])
(1,[2015-02-01#2])
(2,[2015-02-02#2])

You can create this type of object right from your initial data:

A = LOAD 'items.csv' USING PigStorage(',') as (Id,ItemId,Count:chararray,CreationDate:chararray);
storeHbase = FOREACH A GENERATE ItemId, TOMAP(CreationDate, Count);

Or if you want to achieve it after some grouping by key:

B = FOREACH A GENERATE ItemId, TOTUPLE(CreationDate, Count) as pair;
C = GROUP B BY ItemId;
storeHbase = FOREACH C {
    Tmp = FOREACH $1 GENERATE TOMAP(pair.CreationDate,pair.Count);
    GENERATE group, FLATTEN(Tmp);
};

And after all you can load your data to the HBase:

STORE storeHbase into 'hbase://mytable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('Attributes:*');

where mytable is your HBase table and Attributes is your column family.