I've been stuck on this question for a while. I have a data file that looks like this:
2012/01/01 Name1 "Category1,Category2,Category3"
2012/01/01 Name2 "Category2,Category3"
2012/01/01 Name3 "Category1,Category5"
Each item is associated with a comma-separated list of categories. I would like to be able to group by category name, to get output like this:
Category1 Name1, Name3
Category2 Name1, Name2
...
Category5 Name3
(even more specifically, I don't need the names of the items - just the counts of number of items in that category would do)
I ended up writing a UDF to take the comma-separated category field, and convert it to a Pig bag. My data schema is now something like this:
{date: chararray, name: chararray, categories: {t: (category:chararray)}}
I am stuck on the next step - actually performing a grouping by nested bag value. I have tried variations of nested FOREACH statement without any luck. For example:
x = FOREACH myData
{
categoryNames = FOREACH categories GENERATE category;
GENERATE myData.Name, categoryNames;
}
My thought was that this kind of syntax could generate tuples of (Name, category), which I can run a GROUP over. However, the actual output is the whole bag, taking me back to square 1. I am out of ideas on how to proceed - help/feedback would be most appreciated. Thanks!