I am facing two issues:
Report Files
I'm generating PIG report. The output of which goes into several files: part-r-00000, part-r-00001,... (This results fromt he same relationship, just multiple mappers are producing the data. Thus there are multiple files.):
B = FOREACH A GENERATE col1,col2,col3; STORE B INTO $output USING PigStorage(',');
I'd like all of these to end up in one report so what I end up doing is before storing the result using
HBaseStorage
, I'm sorting them using parallel 1:report = ORDER report BY col1 PARALLEL1
. In other words I am forcing the number of reducers to 1, and therefore generating a single file as follows:B = FOREACH A GENERATE col1,col2,col3; B = ORDER B BY col1 PARALLEL 1; STORE B INTO $output USING PigStorage(',');
Is there a better way of generating a single file output?
Group By
I have several reports that perform group-by:
grouped = GROUP data BY col
unless I mentionparallel 1
sometimes PIG decides to use several reducers to group the result. When I sum or count the data I get incorrect results. For example:Instead of seeing this:
part-r-00000: grouped_col_val_1, 5, 6 grouped_col_val_2, 1, 1 part-r-00001: grouped_col_val_1, 3, 4 grouped_col_val_2, 5, 5
I should be seeing:
part-r-00000: grouped_col_val_1, 8, 10 grouped_col_val_2, 6, 6
So I end up doing my group as follows:
grouped = GROUP data BY col PARALLEL 1
then I see the correct result.I have a feeling I'm missing something.
Here is a pseudo-code for how I am doing the grouping:
raw = LOAD '$path' USING PigStorage... row = FOREACH raw GENERATE id, val grouped = GROUP row BY id; report = FOREACH grouped GENERATE group as id, SUM(val) STORE report INTO '$outpath' USING PigStorage...