1
votes

I have a dataset with booking hotels. date_in has format "yyyy-MM-dd". I need select top 10 the most visited hotel by month.

SELECT top_visits.date_ci, top_visits.hotel_id, top_visits.count_visits
FROM (
   SELECT  date_ci, hotel_id, COUNT(id) AS count_visits, 
   RANK() OVER (
   PARTITION BY date_ci, hotel_id ORDER BY COUNT(id) DESC) as rank
   FROM ( 
      SELECT id, hotel_id, SUBSTRING(my_tab.date_in, 1, 7) as date_ci 
      FROM my_database.my_tab) x
   ) top_visits
GROUP BY date_ci, hotel_id HAVING rank <= 10;

I get the following error:

Error: Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 4:13 Expression not in GROUP BY key 'hotel_id'

2
PARTITION BY date_ci, hotel_id ORDER BY COUNT(id) DESC) as rank -> PARTITION BY date_ci ORDER BY COUNT(id) DESC) as rankGirlInBigData

2 Answers

0
votes

I would suggest something like this:

SELECT ymh.*
FROM (SELECT YEAR(date_in) as yyyy, MONTH(date_in) as mm, hotel_id, 
             COUNT(*) AS count_visits, 
             ROW_NUMBER() OVER (PARTITION BY YEAR(date_in), MONTH(date_in), COUNT(*) DESC) as seqnum
      FROM my_database.my_tab
      GROUP BY YEAR(date_in), MONTH(date_in)
     ) ymh
WHERE seqnum <= 10;

That is, there is one aggregation and one window function call to enumerate the values.

0
votes

Move COUNT(id) aggregation into the subquery, add group by:

SELECT top_visits.date_ci, top_visits.hotel_id, top_visits.count_visits
FROM (
      SELECT date_ci, hotel_id, count_visits, 
             RANK() OVER (PARTITION BY date_ci, hotel_id ORDER BY count_visits DESC) as rank
        FROM ( 
              SELECT hotel_id, SUBSTRING(my_tab.date_in, 1, 7) as date_ci,
                     COUNT(id) AS count_visits 
                FROM my_database.my_tab
              GROUP BY hotel_id, SUBSTRING(my_tab.date_in, 1, 7)
             ) x
     ) top_visits
WHERE rank <= 10;