2
votes

I have the following relation in a pig script:

my_relation: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,timeseries,([value#50.0,timestamp#1388675231000]))
(++JRGOCZQD,timeseries,([value#50.0,timestamp#1388592317000],[value#25.0,timestamp#1388682237000]))
(++GCYI1OO4,timeseries,())
(++JYY0LOTU,timeseries,())

There can be any number of value/timestamp pairs in the bytearray column (even zero).

I would like to transform this relation into this (one row for each entityId, attributeName, value, timestamp quartet):

++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000
++GCYI1OO4,timeseries,,
++JYY0LOTU,timeseries,,

Alternatively this would be fine too - I am not interested in the rows that have no values/timestamp

++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000

Any ideas? Basically I want to normalize the tuple of maps in the bytearray column so that the schema is like this:

my_relation: {entityId: chararray,
              attributeName: chararray, 
              value: float, 
              timestamp: int}

I am a pig beginner so sorry if this is obvious! Do I need a UDF to do this?

This question is similar but has no answers so far: How do I split in Pig a tuple of many maps into different rows

I am running Apache Pig version 0.12.0-cdh5.1.2

EDIT - adding details of what I've done so far.

Here's a pig script snippet, with output below:

-- StateVectorFileStorage is a LoadStoreFunc and AttributeData is a UDF, both java. 
ts_to_average = LOAD 'StateVector' USING StateVectorFileStorage();
ts_to_average = LIMIT ts_to_average 10;
ts_to_average = FOREACH ts_to_average GENERATE entityId, FLATTEN(AttributeData(*));
a = FOREACH ts_to_average GENERATE entityId, $1 as attributeName:chararray, $2#'value';
b = foreach a generate entityId, attributeName, FLATTEN($2);

c_no_flatten = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  TOBAG($2 ..);

c = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  FLATTEN(TOBAG($2 ..));

d = foreach c generate
  entityId,
  attributeName,
  (float)$2#'value' as value,
  (int)$2#'timestamp' as timestamp;

dump a;
describe a;
dump b;
describe b;
dump c_no_flatten;
describe c_no_flatten;
dump c;
describe c;
dump d;
describe d;

Output follows. Notice how in the relation 'c', the second value/timestamp pair [value#52.0,timestamp#1388683516000] is lost.

(++JIYMIS2D,RechargeTimeSeries,([value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000]))
(++JRGOCZQD,RechargeTimeSeries,([value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries,())
a: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries)
b: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,RechargeTimeSeries,{([value#50.0,timestamp#1388675231000])})
(++JRGOCZQD,RechargeTimeSeries,{([value#50.0,timestamp#1388592317000])})
(++GCYI1OO4,RechargeTimeSeries,{()})
c_no_flatten: {entityId: chararray,attributeName: chararray,{(bytearray)}}

(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000])
(++GCYI1OO4,RechargeTimeSeries,)
c: {entityId: chararray,attributeName: chararray,bytearray}

(++JIYMIS2D,RechargeTimeSeries,50.0,1388675231000)
(++JRGOCZQD,RechargeTimeSeries,50.0,1388592317000)
(++GCYI1OO4,RechargeTimeSeries,,)
d: {entityId: chararray,attributeName: chararray,value: float,timestamp: int}
1

1 Answers

0
votes

This should do the the trick. First, flatten the tuple of maps to get rid of the encapsulating tuple:

b = foreach a generate entityId, attributeName, FLATTEN($2);

Now we can convert everything but the first two fields into a bag. The bag can be flattened (see http://pig.apache.org/docs/r0.12.0/basic.html#flatten) to get rows for each value/timestamp pair:

c = foreach b generate
  $0 as entityId,
  $1 as attributeName,
  FLATTEN(TOBAG($2 ..));

Lastly, get the values you need out of the map:

d = foreach c generate
  entityId,
  attributeName,
  (float)$2#'value' as value,
  (int)$2#'timestamp' as timestamp;

Update: Some other options to make a bag of maps out of the tuple of maps: