0
votes

I have a partitioned table in hive. The schema and a sample is shown below

item_id | price | brand | partition_id
AX_12      340.22  Apple.    356
AZ_47      230.00  Samsung   357
AX_12      321.00. Apple.    357
AQ_17.     125.00  Lenovo.   356

If an item is present in multiple partitions. I need to select the row with latest partition So the expected output for this example is this

item_id | price | brand | partition_id
AX_12      321.00  Apple.    357
AZ_47      230.00  Samsung   357
AQ_17.     125.00  Lenovo.   356

There are 10 partitions in the table and each partition has 10 million rows

1

1 Answers

0
votes

You can use window functions to filter on the top record per group:

select t.*
from (
    select t.*, row_number() over(partition by item_id order by partition_id desc) rn 
    from mytable t
)
where rn = 1

A typical alternative is to filter with a correlated subquery:

select t.*
from mytable t
where t.partition_id = (
    select max(t1.partition_id) from mytbale t1 where t1.item_id = t.item_id
)