0
votes

How could I display the SUM of the sizes for the same Classification and Type in the following example please?

((classification,Secret),(type,Document.Office),{((size,557856))}) ((classification,Secret),(type,Blog.ExternalPost),{((size,4478993))}) ((classification,Secret),(type,Social.Post.Twitter),{((size,1902045))}) ((classification,Secret),(type,Social.Post.Facebook),{((size,2085060)),((size,557856)),((size,1555956))}) ((classification,External),(type,Blog.ExternalPost),{((size,1902045))}) ((classification,External),(type,Blog.InternalPost),{((size,1438853))}) ((classification,External),(type,Social.Post.Facebook),{((size,1234311)),((size,4260972))})

This is the output from the describe function for the above relation in Pig;

{classification: (name: chararray,value: chararray),type: (name: chararray,value: chararray),{(size: (name: chararray,value: int))}}

I've tried the following but with no luck:

sum = foreach groupedfinal generate $0, $1, SUM($2);

Error: Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast

Your help would be greatly appreciated.

Thanks Mskh

1

1 Answers

1
votes

You have a couple of problems here. First, the error message: this indicates that Pig cannot figure out which kind of SUM to compute -- whether it's summing integers, floats, etc. The input to SUM should be a bag, where each tuple in the bag contains a number to be summed. This doesn't work for you because each tuple in the bag contains another tuple.

This brings us to the second problem: your data organization. Semantically, you really only have three fields here: classification, type, and a bag of sizes. But you are storing these three fields wrapped in tuples, with the name of the field duplicated as a chararray in the first element of each tuple. This wastes space and makes your data much harder to process.

You can project out an individual element of a bag's tuples, like $2.size to get a bag of just these elements. But in your case, this doesn't change anything because each size in your bag is not a number, it's another tuple, and there's no way to access this tuple's elements.

You could get around this by FLATTENing the bag, and then FLATTENing the tuple, and then re-GROUPing, but I think the best solution is for you to look further upstream and restructure your data so you don't have this kind of nesting and useless fields.