2
votes

I have a dataset structured such as the one below stored in Hive, call it df:

+-----+-----+----------+--------+
| id1 | id2 |   date   | amount |
+-----+-----+----------+--------+
|   1 |   2 | 11-07-17 | 0.93   |
|   2 |   2 | 11-11-17 | 1.94   |
|   2 |   2 | 11-09-17 | 1.90   |
|   1 |   1 | 11-10-17 | 0.33   |
|   2 |   2 | 11-10-17 | 1.93   |
|   1 |   1 | 11-07-17 | 0.25   |
|   1 |   1 | 11-09-17 | 0.33   | 
|   1 |   1 | 11-12-17 | 0.33   |
|   2 |   2 | 11-08-17 | 1.90   |
|   1 |   1 | 11-08-17 | 0.30   |
|   2 |   2 | 11-12-17 | 2.01   |
|   1 |   2 | 11-12-17 | 1.00   |
|   1 |   2 | 11-09-17 | 0.94   |
|   2 |   2 | 11-07-17 | 1.94   |
|   1 |   2 | 11-11-17 | 1.92   |
|   1 |   1 | 11-11-17 | 0.33   |
|   1 |   2 | 11-10-17 | 1.92   |
|   1 |   2 | 11-08-17 | 0.94   |
+-----+-----+----------+--------+

I wish to partition by id1 and id2, and then order by date descending within each grouping of id1 and id2, and then rank "amount" within that, where the same "amount" on consecutive days would receive the same rank. The ordered and ranked output I'd hope to see is shown here:

+-----+-----+------------+--------+------+
| id1 | id2 |    date    | amount | rank |
+-----+-----+------------+--------+------+
|   1 |   1 | 2017-11-12 | 0.33   |    1 |
|   1 |   1 | 2017-11-11 | 0.33   |    1 |
|   1 |   1 | 2017-11-10 | 0.33   |    1 |
|   1 |   1 | 2017-11-09 | 0.33   |    1 |
|   1 |   1 | 2017-11-08 | 0.30   |    2 |
|   1 |   1 | 2017-11-07 | 0.25   |    3 |
|   1 |   2 | 2017-11-12 | 1.00   |    1 |
|   1 |   2 | 2017-11-11 | 1.92   |    2 |
|   1 |   2 | 2017-11-10 | 1.92   |    2 |
|   1 |   2 | 2017-11-09 | 0.94   |    3 |
|   1 |   2 | 2017-11-08 | 0.94   |    3 |
|   1 |   2 | 2017-11-07 | 0.93   |    4 |
|   2 |   2 | 2017-11-12 | 2.01   |    1 |
|   2 |   2 | 2017-11-11 | 1.94   |    2 |
|   2 |   2 | 2017-11-10 | 1.93   |    3 |
|   2 |   2 | 2017-11-09 | 1.90   |    4 |
|   2 |   2 | 2017-11-08 | 1.90   |    4 |
|   2 |   2 | 2017-11-07 | 1.94   |    5 |
+-----+-----+------------+--------+------+

I attempted this with the following SQL query:

SELECT 
    id1, 
    id2, 
    date, 
    amount,
    dense_rank() OVER (PARTITION BY id1, id2 ORDER BY date DESC) AS rank
FROM
    df
GROUP BY
    id1,
    id2,
    date,
    amount

But that query doesn't seem to be doing what I'd like it to as I'm not receiving the output I'm looking for.

It seems like a window function using dense_rank, partition by and order by is what I need but I can't quite seem to get it to give me that sample output that I desire. Any help would be much appreciated! Thanks!

1
the complication is "on consecutive days" so far your code does not attempt to accommodate that, mmm - Paul Maxwell
wouldn't dense_rank() give the same rank to consecutive combinations of id1, id2 and amount? that was what I was going for anyway - relatively new to window functions so maybe my interpretation is incorrect - dave
you have ordered dense_rank by date, not amount, so it is sequencing by change of date, what you want is similar amounts to get the same rank, the final result rows can be then ordered by date - Paul Maxwell
I am not sure this would work for my use case - in doing it this way wouldn't rows 14 and 18 receive the same rank? when I would need them to be ranked 2 and 5 respectively - dave
dense_rank() is an analytical function. - Tutu Kumari

1 Answers

2
votes

This is quite tricky. I think you need to use lag() to see where the value changes and then do a cumulative sum:

select df.*,
       sum(case when prev_amount = amount then 0 else 1 end) over
           (partition by id1, id2 order by date desc) as rank
from (select df.*,
             lag(amount) over (partition by id1, id2 order by date desc) as prev_amount
      from df
     ) df;