Grouping by bag value in Pig

Question

I've been stuck on this question for a while. I have a data file that looks like this:

2012/01/01        Name1      "Category1,Category2,Category3"
2012/01/01        Name2      "Category2,Category3"
2012/01/01        Name3      "Category1,Category5"

Each item is associated with a comma-separated list of categories. I would like to be able to group by category name, to get output like this:

Category1    Name1, Name3
Category2    Name1, Name2
...
Category5    Name3

(even more specifically, I don't need the names of the items - just the counts of number of items in that category would do)

I ended up writing a UDF to take the comma-separated category field, and convert it to a Pig bag. My data schema is now something like this: {date: chararray, name: chararray, categories: {t: (category:chararray)}}

I am stuck on the next step - actually performing a grouping by nested bag value. I have tried variations of nested FOREACH statement without any luck. For example:

x = FOREACH myData
{
categoryNames = FOREACH categories GENERATE category;
GENERATE myData.Name, categoryNames;
}

My thought was that this kind of syntax could generate tuples of (Name, category), which I can run a GROUP over. However, the actual output is the whole bag, taking me back to square 1. I am out of ideas on how to proceed - help/feedback would be most appreciated. Thanks!

Romain Romain · Accepted Answer · 2012-02-04T21:42:40

Assuming each name is unique in your data file, you could FLATTEN the bag of category, then GROUP by category and COUNT the number of names by category.

e.g.

name_category = 
    FOREACH data
    GENERATE
        name,
        FLATTEN(categories) AS category;

category_group =
    GROUP name_category
    BY category;

category_count =
    FOREACH category_group
    GENERATE
        FLATTEN(group) AS category,
        COUNT(name_category) AS count;

Grouping by bag value in Pig

1 Answers