3
votes

I'm trying to calculate percentile using Pig. I need to group data using an attribute and calculate percentiles for each tuple in the group based on sales.

I've seen there is no built in Pig function to do this. Wondering if anyone faced similar problem before can help me.

1

1 Answers

6
votes

As JaiPrakash mentioned, you can use the UDF StreamingQuantile from the Apache DataFu library. Since I already have an example ready, I'll just copy it here.

Input

item1,234
item1,324
item1,769
item2,23
item2,23
item2,45

PIG Script

register datafu-1.2.0.jar;
define Quantile datafu.pig.stats.StreamingQuantile('0.0','0.5','1.0');
data = load 'data' using PigStorage(',') as (item:chararray, value:int);
quantiles = FOREACH (GROUP data by item) GENERATE group, Quantile(data.value);
dump quantiles;

Output

(item1,(234.0,324.0,769.0))
(item2,(23.0,23.0,45.0))