0
votes

I have a common table expression with a window function and keep getting an error message:

Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 82:6 Invalid column reference 'gcr_amt' in definition of CTE pro_orders [ select o.shopper_id as pro_shopper_id, date_format(o.order_date, 'YYYYMM') as ym_order, sum(o.gcr_amt) as total_gcr, sum(case when o.product_pnl_new_renewal_name = 'New Purchase' then o.gcr_amt end) as new_gcr, sum(o.gcr_amt) over (partition by o.shopper_id rows between 12 preceding and 1 following) as 12months_direct_gcr from dp_enterprise.uds_order o inner join combined_shopper_level_data cs on cs.pro_shopper_id = o.shopper_id and cs.year_month = date_format(o.order_date, 'YYYYMM') where o.exclude_reason_desc is Null group by o.shopper_id, o.order_date ] used as po at Line 83:5

My cte looks like this:

pro_orders as (
  select  o.shopper_id as pro_shopper_id,
          date_format(o.order_date, 'YYYYMM') as ym_order,
          sum(o.gcr_amt) as total_gcr,
          sum(case when o.product_pnl_new_renewal_name = 'New Purchase' then o.gcr_amt end) as new_gcr,
          sum(o.gcr_amt) over (partition by o.shopper_id, cs.year_month order by cs.year_month desc rows between 12 preceding and 0 following) as 12months_direct_gcr
  from dp_enterprise.uds_order o
  right join combined_shopper_level_data cs on cs.pro_shopper_id = o.shopper_id and cs.year_month = date_format(o.order_date, 'YYYYMM')
  group by o.shopper_id, o.order_date
),

I don't use window functions that often and perhaps my syntax is off. In english what I'm trying to do is get the 12 months total for metric "gcr".

So a row with shopper_id 123abc in year_month 201901, I'd like to sum the preceding 11 months plus the current rows month of gcr for a total of 12 months. Not sure if my window function is correctly set up for that or not?

The format of year_month being referenced is YYYYMM e.g. 201901.

Is my window function set up right given my goal?

How can I overcome this error message?

Edit: Still receiving this error message with the following CTE:

pro_orders as (
  select  o.shopper_id as pro_shopper_id,
          cs.year_month,
          sum(case when date_format(o.order_date, 'YYYYMM') = cs.year_month then o.gcr_amt else 0 end) as total_gcr,
          sum(case when date_format(o.order_date, 'YYYYMM') = cs.year_month and o.product_pnl_new_renewal_name = 'New Purchase' then o.gcr_amt else 0 end) as new_gcr,
          sum(sum(o.gcr_amt)) over  (partition by o.shopper_id 
                                order by cs.year_month desc 
                                rows between 12 preceding and 0 following) 
                                as 12months_direct_gcr
  from combined_shopper_level_data cs
  left join dp_enterprise.uds_order o on o.shopper_id = cs.pro_shopper_id
  where o.exclude_reason_desc is Null
  group by o.shopper_id, cs.year_month
),

Results in similar error message:

Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 83:10 Invalid column reference 'gcr_amt' in definition of CTE pro_orders [ select o.shopper_id as pro_shopper_id, cs.year_month, sum(case when date_format(o.order_date, 'YYYYMM') = cs.year_month then o.gcr_amt else 0 end) as total_gcr, sum(case when date_format(o.order_date, 'YYYYMM') = cs.year_month and o.product_pnl_new_renewal_name = 'New Purchase' then o.gcr_amt else 0 end) as new_gcr, sum(sum(o.gcr_amt)) over (partition by o.shopper_id order by cs.year_month desc rows between 12 preceding and 0 following) as 12months_direct_gcr from combined_shopper_level_data cs left join dp_enterprise.uds_order o on o.shopper_id = cs.pro_shopper_id where o.exclude_reason_desc is Null group by o.shopper_id, cs.year_month ] used as po at Line 87:5

1

1 Answers

1
votes

You have an aggregation query, so the window function looks a little funny. The basic idea is something like this:

sum(sum(o.gcr_amt)) over (partition by o.shopper_id, cs.year_month
                          order by cs.year_month desc
                          rows between 12 preceding and 0 following
                         ) as 12months_direct_gcr

This still will not work. First, you have the value in the order by and partition by. Second, it is not in the group by.

Assuming there is a value for each month, then you can use:

sum(sum(o.gcr_amt)) over (partition by o.shopper_id
                          order by cs.year_month desc
                          rows between 12 preceding and 0 following
                         ) as 12months_direct_gcr

And use cs.year_month in the group by (which may require adjusting other parts of the query.

For readability, I also recommend that you use left join rather than right join. For me (and most people), it is much simpler cognitively to say "keep all the rows in the first table I just read" rather than "keep all the rows in some table I'm going to read at the end of the from clause".

EDIT:

I think the full query is:

with pro_orders as (
      select o.shopper_id as pro_shopper_id,
             cs.year_month,
             sum(coalesce(o.gcr_amt, 0)) as total_gcr,
             sum(case when o.product_pnl_new_renewal_name = 'New Purchase' then o.gcr_amt else 0 end) as new_gcr,
             sum(sum(o.gcr_amt)) over (partition by o.shopper_id 
                                       order by cs.year_month desc 
                                       rows between 12 preceding and 0 following
                                      ) as 12months_direct_gcr
      from combined_shopper_level_data cs left join
           dp_enterprise.uds_order o
           on o.shopper_id = cs.pro_shopper_id and
              date_format(o.order_date, 'YYYYMM') = cs.year_month and
              o.exclude_reason_desc is Null
      group by o.shopper_id, cs.year_month
     ),

It is possible that there is a limitation in Hive on using window functions in aggregation queries (this would sort of surprise me, because these are processed separately). I can't find a specific reference to this. If so, just use a subquery:

with pro_orders as (
      select pro_shopper_id, year_month, total_gcr, new_gcr
             sum(sum(total_gcr_amt)) over (partition by pro_shopper_id
                                           order by year_month desc 
                                           rows between 12 preceding and 0 following
                                          ) as 12months_direct_gcr
      from (select o.shopper_id as pro_shopper_id,
                   cs.year_month,
                   sum(coalesce(o.gcr_amt, 0)) as total_gcr,
                   sum(case when o.product_pnl_new_renewal_name = 'New Purchase' then o.gcr_amt else 0 end) as new_gcr,
          from combined_shopper_level_data cs left join
               dp_enterprise.uds_order o
               on o.shopper_id = cs.pro_shopper_id and
                  date_format(o.order_date, 'YYYYMM') = cs.year_month and
                  o.exclude_reason_desc is Null
          group by o.shopper_id, cs.year_month
         ) ps
     ),