0
votes

My Input file is below

a,t1,1000,100
a,t1,2000,200
b,t2,1000,200
b,t2,5000,100

How to find count of distinct $0 in the above file.

myinput = LOAD 'file' AS(a1:chararray,a2:chararray,amt:int,rate:int);

After the above script what needs to done. Also Can I use that distinct count for dividing some other is a different relation

2

2 Answers

3
votes

First of all, the way you read the data is incorrect. If you try to dump "myinput", youll see that the whole row is read in the first field (a1), while the others are empty. The reason is that you don't specify a LOAD function, and a default function is the PigStorage() built-in function which expects tab-delimited file (so it ignores your commas!).You need to explicitly specify a load function (e.g. PigStorage()) via the using clause and pass it arguments:

myInput = LOAD file' using PigStorage(',');
myInput2 = FOREACH myInput GENERATE $0 as (a1:chararray), $1 as (a2:chararray), $2 as (amt:int), $3 as (rate:int);

Moving on, to find the DISTINCT $0 first you have to extract field $0 in a separate relation. The reason is that the DISTINCT statement works on entire records, rather than on separate fields.

myField = FOREACH myInput2 GENERATE a1;
distinctA1 = DISTINCT myField;

Now the result of distinctA1 is {(a), (b)}. By using now group all, you will group together all of your records together, and then what is left is to COUNT them:

grouped = GROUP distinctA1 all;
countA1 = FOREACH grouped GENERATE COUNT(distinctA1);

And now you're happy. :) The complete code:

myInput = LOAD 'file' using PigStorage(',');
myInput2 = FOREACH myInput GENERATE $0 as (a1:chararray), $1 as (a2:chararray), $2 as    (amt:int), $3 as (rate:int);
a1 = FOREACH myInput2 GENERATE a1;
distinctA1 = DISTINCT a1;
grouped = GROUP distinctA1 all;
countA1 = FOREACH grouped GENERATE COUNT(distinctA1);
0
votes

You can do something like this :

myInput = LOAD 'file.txt' USING PigStorage(',') AS (a1:chararray,a2:chararray,amt:int,rate:int);
Data = GROUP myInput BY $0;
Data = FOREACH Data GENERATE $0;
Data = GROUP Data ALL;
Data = FOREACH Data GENERATE $0,COUNT($1);

NB: By Grouping on $0 you are doing the same thing as a distinct and you get better performance ;)