0
votes

TL;DR:

  • Say I have a table receipt with columns (ReceiptID, ItemName, Spend).
  • How can I create a view that limits to a maximum of one unique ReceiptID row?

Details of what I tried and it's drawbacks

// Creating the view `view_of_unique_receipts`
WITH ordered_receipts AS (
    SELECT *, 
           ROW_NUMBER() OVER ( PARTITION BY ReceiptName) AS rn
    FROM receipt_*
)
SELECT * EXCEPT (rn) 
FROM ordered_receipts
WHERE rn = 1

  • Doing SELECT * FROM receipt costs e.g. 50 GB
  • Doing SELECT ReceiptName, Spend FROM receipt costs e.g. 10 GB
  • Doing SELECT ReceiptName, Spend FROM view_of_unique_receipts costs 50GB
    • How can I make this only cost 10gb?
    • It costs 50GB because the row_number partition always selects all the columns (even though the final query is not asking for it)
2

2 Answers

1
votes
WITH ordered_receipts AS (
    SELECT *, 
           ROW_NUMBER() OVER ( PARTITION BY ReceiptName) AS rn
    FROM receipt_*
)
SELECT ReceiptName, Spend 
FROM ordered_receipts
WHERE rn = 1
0
votes

Partition & Clustering over table can limit number of records processed and its corresponding analysis cost. Ref.

Partition are done generally some range of values like dates, clustering is done to group data over low cardinality string fields (e.g. Country, City, sates and so on).

if you able to implement Partition/Clustering over your source table, then you could able to achieve control over Analysis cost.