4
votes

I've been stuck on this question for a while. I have a data file that looks like this:

2012/01/01        Name1      "Category1,Category2,Category3"
2012/01/01        Name2      "Category2,Category3"
2012/01/01        Name3      "Category1,Category5"

Each item is associated with a comma-separated list of categories. I would like to be able to group by category name, to get output like this:

Category1    Name1, Name3
Category2    Name1, Name2
...
Category5    Name3

(even more specifically, I don't need the names of the items - just the counts of number of items in that category would do)

I ended up writing a UDF to take the comma-separated category field, and convert it to a Pig bag. My data schema is now something like this: {date: chararray, name: chararray, categories: {t: (category:chararray)}}

I am stuck on the next step - actually performing a grouping by nested bag value. I have tried variations of nested FOREACH statement without any luck. For example:

x = FOREACH myData
{
categoryNames = FOREACH categories GENERATE category;
GENERATE myData.Name, categoryNames;
}

My thought was that this kind of syntax could generate tuples of (Name, category), which I can run a GROUP over. However, the actual output is the whole bag, taking me back to square 1. I am out of ideas on how to proceed - help/feedback would be most appreciated. Thanks!

1

1 Answers

5
votes

Assuming each name is unique in your data file, you could FLATTEN the bag of category, then GROUP by category and COUNT the number of names by category.

e.g.

name_category = 
    FOREACH data
    GENERATE
        name,
        FLATTEN(categories) AS category;

category_group =
    GROUP name_category
    BY category;

category_count =
    FOREACH category_group
    GENERATE
        FLATTEN(group) AS category,
        COUNT(name_category) AS count;