Problems generating ordered files in Pig

Question

I am facing two issues:

Report Files

I'm generating PIG report. The output of which goes into several files: part-r-00000, part-r-00001,... (This results fromt he same relationship, just multiple mappers are producing the data. Thus there are multiple files.):
```
B = FOREACH A GENERATE col1,col2,col3;
STORE B INTO $output USING PigStorage(',');
```
I'd like all of these to end up in one report so what I end up doing is before storing the result using HBaseStorage, I'm sorting them using parallel 1: report = ORDER report BY col1 PARALLEL1. In other words I am forcing the number of reducers to 1, and therefore generating a single file as follows:
```
B = FOREACH A GENERATE col1,col2,col3;
B = ORDER B BY col1 PARALLEL 1;
STORE B INTO $output USING PigStorage(',');
```
Is there a better way of generating a single file output?
Group By

I have several reports that perform group-by: grouped = GROUP data BY col unless I mention parallel 1 sometimes PIG decides to use several reducers to group the result. When I sum or count the data I get incorrect results. For example:

Instead of seeing this:
```
part-r-00000:
grouped_col_val_1, 5, 6
grouped_col_val_2, 1, 1

part-r-00001:
grouped_col_val_1, 3, 4
grouped_col_val_2, 5, 5
```
I should be seeing:
```
part-r-00000:
grouped_col_val_1, 8, 10
grouped_col_val_2, 6, 6
```
So I end up doing my group as follows: grouped = GROUP data BY col PARALLEL 1 then I see the correct result.

I have a feeling I'm missing something.

Here is a pseudo-code for how I am doing the grouping:
```
raw = LOAD '$path' USING PigStorage...
row = FOREACH raw GENERATE id, val
grouped = GROUP row BY id;
report = FOREACH grouped GENERATE group as id, SUM(val)
STORE report INTO '$outpath' USING PigStorage...
```

Please see my new answer. BTW, I don't want to be nitpicky, but it is "pseudocode", not "sudo-code" (unless you are referring to code executing with superuser permissions!). — cabad
Are you using a custom partitioner? This could be affecting your results; see my answer for details. — cabad
@cabad - I did see your update, I'm not using a custom partitioner. I'm grouping by multiple columns...that's all. — hba
Well, the results you describe are not how pig works. So either you are hitting a bug, or you've missed something in the pseudocode you included. For example, you are not grouping by multiple columns in your pseudocode; this wouldn't matter, but other things you are missing may. — cabad

cabad cabad · Accepted Answer · 2013-10-03T19:07:16

EDIT, new answers based on the extra details you provided:

1) No, the way you describe it is the only way to do it in Pig. If you want to download the (sorted) files, it is as simple as doing a hdfs dfs -cat or hdfs dfs -getmerge. For HBase, however, you shouldn't need to do extra sorting if you use the -loadKey=true option of HBaseStorage. I haven't tried this, but please try it and let me know if it works.

2) PARALLEL 1 should not be needed. If this is not working for you, I suspect your pseudocode is incomplete. Are you using a custom partitioner? That is the only explanation I can find to your results, because the default partitioner used by GROUP BY sends all instances of a key to the same reducer, thus giving you the results you expect.

OLD ANSWERS:

1) You can use a merge join instead of just one reducer. From the Apache Pig documentation:

Often user data is stored such that both inputs are already sorted on the join key. In this case, it is possible to join the data in the map phase of a MapReduce job. This provides a significant performance improvement compared to passing all of the data through unneeded sort and shuffle phases.

The way to do this is as follows:

C = JOIN A BY a1, B BY b1, C BY c1 USING 'merge';

2) You shouldn't need to use PARALLEL 1 to get your desired result. The GROUP should work fine, regardless of the number of reducers you are using. Can you please post the code of the script you use for Case 2?

Problems generating ordered files in Pig

1 Answers