0
votes

I have a transaction table t1 in Hive that looks like this:

store_id   cust_id    zip_code   transaction_count   spend  
    1000       100      123      3                    50
    2000       200      456      2                    20

I'm trying to compute some metrics using this table as follows:

SELECT
    store_id,
    zip_code,
    SUM(transaction_count) OVER (PARTITION BY store_id, zip_code) / COUNT(*) OVER(PARTITION BY store_id, zip_code) AS avg_transactions,
    SUM(spend) OVER(PARTITION BY store_id, zip_code)/SUM(transaction_count) OVER(PARTITION BY store_id, zip_code) AS avg_purchase
FROM 
    t1
GROUP BY 
    1, 2

Hive is throwing an error

SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Expression not in GROUP BY key 'spend'

But the column spend is being used with an aggregation function (sum), so I cannot group by this column. How can I fix this error?

2

2 Answers

3
votes

Window functions do not need group by. You either want:

SELECT
    store_id,
    zip_code,
    AVG(transaction_count) OVER(PARTITION BY store_id, zip_code) AS avg_transactions,
    SUM(spend) OVER(PARTITION BY store_id, zip_code)
        / SUM(transaction_count) OVER(PARTITION BY store_id, zip_code) AS avg_purchase
FROM t1

Note that the first expression can be shortened with AVG().

This brings as many rows as in the original table. If, on the other hand, you want just one row per store and zip code, then use regular aggregate functions with GROUP BY:

SELECT
    store_id,
    zip_code,
    AVG(transaction_count) AS avg_transactions,
    SUM(spend) / SUM(transaction_count) AS avg_purchase
FROM t1
GROUP BY store_id, zip_code
1
votes

Hmmm . . . I think you want:

SELECT store_id, zip_code, AVG(transaction_count) as avg_transactions, 
       SUM(spend) / SUM(transaction_count) as avg_purchase
FROM t1
GROUP BY 1, 2;

I don't see why you are using window functions.