1
votes

I'm new at apache pig and wish to implement bottom-up-cubing by writing a pig script. However, this would require me to group in a hierarchial fashion.

For example, if my data is in the form of (exchange,symbol,date,dividend) where dividend is a measure and the rest are dimensions, I would like to first group data by exchange and print aggregate dividend and then further by exchange and symbol and so on.

One way to do this is to write all possible groupings in the script such as group by exchange, group by symbol, group by (exchange,symbol),etc. However, this appears to be unoptimal. Is there a way to (for example) first group by exchange, and then for every exchange group, internally group by symbol to generate aggregates for (exchange) and then for (exchange,symbol) as this would be more efficient.

Something similar is discussed here but it didn't answer my question : Can I generate nested bags using nested FOREACH statements in Pig Latin? Thanks!

1
Could you provide a sample of output you expect. Do you want one output dir per group permutation or just one file with totals for each group in each row? - alexeipab

1 Answers

1
votes

This all depends on your definition of "optimal". Your intuition that if you first do a granular grouping by (exchange, symbol, dividend), then group the results of that to get by (exchange, symbol), then the results of that to get by (exchange) is correct in the sense that you will do fewer arithmetic operations. However, your map-reduce flow will be distinctly sub-optimal. This will require 3 map-reduce jobs, with each output feeding the next input.

If you do each grouping independently, you will need only one map-reduce job. The mapper will emit key-value pairs for each grouping and the reducer will handle aggregating each kind separately. One map-reduce job means fewer bytes read from and written to disk, and less time spent setting up and tearing down Hadoop jobs. And unless you are doing a very computationally intensive process (and computing an average is definitely not), these factors, especially the disk I/O, are the most significant considerations in how long a job will take.