0
votes

I am a senior developer but new to Pig.

We have a use case to construct a metric in Pig Latin as follows

Count of Customers who (purchased items month AND purchased items prior month) / Count of customers who purchased items in prior month

First step would seem to be to generate the customer counts with FOREACH GROUP GENERATE COUNT(Purchases); and write it to a file, then read it back in again

When i read the data back in again, is there a way in a for each to compare the current row (which would now be an aggregate count by month) and the previous row

Possibly the data should be pivoted before the data is written out and read back in again, and each column compared to the 'previous' going left to right instead of row by row?

can a case statement in pig have something like this

case (customerboolean_has_sales_february + customerboolean_has_sales_january)

2 countsalesfeb+ countalesjan/countsalesjanuary 1 null 0 null

Customers who rode in month AND rode in prior month / Total customers who rode in prior month

1

1 Answers

0
votes

1)

  • Load the file twice into relation A and relation B.

  • Sort the relations A & B in sorted order of month.

  • Use RANK to generate row numbers.

  • From relation B remove the top record.

  • Use RANK on relation B to get new relation B_New

  • Join A and B_New on the row numbers.

  • For each row generate the result you want from the columns.

I've answered similar question here

2)

*can a case statement in pig have something like this? case (customerboolean_has_sales_february + customerboolean_has_sales_january) *

  • Yes.