I have the following relation in a pig script:
my_relation: {entityId: chararray,attributeName: chararray,bytearray}
There can be any number of value/timestamp pairs in the bytearray column (even zero).
I would like to transform this relation into this (one row for each entityId, attributeName, value, timestamp quartet):
Alternatively this would be fine too - I am not interested in the rows that have no values/timestamp
Any ideas? Basically I want to normalize the tuple of maps in the bytearray column so that the schema is like this:
my_relation: {entityId: chararray,
attributeName: chararray,
value: float,
timestamp: int}
I am a pig beginner so sorry if this is obvious! Do I need a UDF to do this?
This question is similar but has no answers so far: How do I split in Pig a tuple of many maps into different rows
I am running Apache Pig version 0.12.0-cdh5.1.2
EDIT - adding details of what I've done so far.
Here's a pig script snippet, with output below:
-- StateVectorFileStorage is a LoadStoreFunc and AttributeData is a UDF, both java.
ts_to_average = LOAD 'StateVector' USING StateVectorFileStorage();
ts_to_average = LIMIT ts_to_average 10;
ts_to_average = FOREACH ts_to_average GENERATE entityId, FLATTEN(AttributeData(*));
a = FOREACH ts_to_average GENERATE entityId, $1 as attributeName:chararray, $2#'value';
b = foreach a generate entityId, attributeName, FLATTEN($2);
c_no_flatten = foreach b generate
$0 as entityId,
$1 as attributeName,
TOBAG($2 ..);
c = foreach b generate
$0 as entityId,
$1 as attributeName,
d = foreach c generate
(float)$2#'value' as value,
(int)$2#'timestamp' as timestamp;
dump a;
describe a;
dump b;
describe b;
dump c_no_flatten;
describe c_no_flatten;
dump c;
describe c;
dump d;
describe d;
Output follows. Notice how in the relation 'c', the second value/timestamp pair [value#52.0,timestamp#1388683516000] is lost.
a: {entityId: chararray,attributeName: chararray,bytearray}
b: {entityId: chararray,attributeName: chararray,bytearray}
c_no_flatten: {entityId: chararray,attributeName: chararray,{(bytearray)}}
c: {entityId: chararray,attributeName: chararray,bytearray}
d: {entityId: chararray,attributeName: chararray,value: float,timestamp: int}