1
votes

Here is my Hive query, straight from the TPC-DS toolkit:

WITH customer_total_return 
     AS (SELECT sr_customer_sk AS ctr_customer_sk, 
                sr_store_sk    AS ctr_store_sk, 
                Sum(sr_fee)    AS ctr_total_return 
         FROM   store_returns, 
                date_dim 
         WHERE  sr_returned_date_sk = d_date_sk 
                AND d_year = 2000 
         GROUP  BY sr_customer_sk, 
                   sr_store_sk) 
SELECT TOP 100 c_customer_id 
FROM   customer_total_return ctr1, 
       store, 
       customer 
WHERE  ctr1.ctr_total_return > (SELECT Avg(ctr_total_return) * 1.2 
                                FROM   customer_total_return ctr2 
                                WHERE  ctr1.ctr_store_sk = ctr2.ctr_store_sk) 
       AND s_store_sk = ctr1.ctr_store_sk 
       AND s_state = 'TN' 
       AND ctr1.ctr_customer_sk = c_customer_sk 
ORDER  BY c_customer_id; 

However, I get the following error when attempting to run it:

FAILED: ParseException line 11:11 cannot recognize input near 'TOP' '100' 'c_customer_id' in selection target

My understanding is that TOP 100 is not syntactically valid in HiveQL. How can I rewrite this properly?

2
Use LIMIT. And proper JOIN syntax.Gordon Linoff

2 Answers

4
votes

Use LIMIT instead of TOP, like this:

WITH customer_total_return 
     AS (SELECT sr_customer_sk AS ctr_customer_sk, 
                sr_store_sk    AS ctr_store_sk, 
                Sum(sr_fee)    AS ctr_total_return 
         FROM   store_returns, 
                date_dim 
         WHERE  sr_returned_date_sk = d_date_sk 
                AND d_year = 2000 
         GROUP  BY sr_customer_sk, 
                   sr_store_sk) 
SELECT c_customer_id 
FROM   customer_total_return ctr1, 
       store, 
       customer 
WHERE  ctr1.ctr_total_return > (SELECT Avg(ctr_total_return) * 1.2 
                                FROM   customer_total_return ctr2 
                                WHERE  ctr1.ctr_store_sk = ctr2.ctr_store_sk) 
       AND s_store_sk = ctr1.ctr_store_sk 
       AND s_state = 'TN' 
       AND ctr1.ctr_customer_sk = c_customer_sk 
ORDER  BY c_customer_id
LIMIT 100; 
1
votes

This is a bad example of a query on many levels. I would suggest:

WITH customer_total_return AS (
      SELECT sr.sr_customer_sk AS ctr_customer_sk, 
             sr.sr_store_sk  AS ctr_store_sk, 
             SUM(sr.sr_fee) AS ctr_total_return,
             AVG(SUM(sr.sr_fee)) OVER (PARTITION BY sr.sr_store_sk) as avg_store_sr_fee
      FROM store_returns sr JOIN
           date_dim d
           ON sr.sr_returned_date_sk = d.d_date_sk 
      WHERE d_year = 2000 
      GROUP  BY sr_customer_sk, sr_store_sk
     ) 
SELECT c.c_customer_id 
FROM customer_total_return ctr JOIN
     store s
     ON s.s_store_sk = ctr.ctr_store_sk JOIN
     customer c
     ON ctr.ctr_customer_sk = c.c_customer_sk
WHERE ctr.ctr_total_return > 1.2 * avg_store_sr_fee AND
      s.s_state = 'TN'  
ORDER  BY c.c_customer_id
LIMIT 100;

Notes:

  • Never use commas in the FROM clause. Always use proper, explicit, standard JOIN syntax.
  • Qualify all column references, especially when a query has more than one table reference.
  • The subquery to calculate the average is not needed.
  • Hive uses LIMIT, not TOP.