2
votes

I have a requirement where-in i need to do data summarization on the date range provided as input. To be more specific: If my data looks like:

Input:
Id|amount|date
1 |10    |2016-01-01
2 |20    |2016-01-02
3 |20    |2016-01-03
4 |20    |2016-09-25
5 |20    |2016-09-26
6 |20    |2016-09-28

And If I want the summarization for the month of September, then I need to calculate count of records on 4 ranges which are:

  1. Current Date, which is each day in September.
  2. Week Start Date(Sunday of the week as per the current date) to Current Date, Ex. if Current Date is 2016-09-28 then week start date would be 2016-09-25 and record counts between 2016-09-25 to 2016-09-28.
  3. Month Start Date to Current Date, which is from 2016-09-01 to Current Date.
  4. Year Start Date to Current Date,which is record count from 2016-01-01 to Current Date.

So My output should have one record with 4 Columns for each day of the month(in this case, Month is September), Something like

Output:

 Current_Date|Current_date_count|Week_To_Date_Count|Month_to_date_Count|Year_to_date_count

 2016-09-25  |1                 |1                 |1                  |4
 2016-09-26  |1                 |2                 |3                  |5
 2016-09-28  |1                 |3                 |3                  |6

Important: i can pass only 2 variables, which is range start date and range end date. Rest calculation need to be dynamic.

Thanks in advance

1

1 Answers

1
votes

You can join on year, then test each condition separately (using sum(if())):

select  a.date, sum(if(a.date=b.date,1,0)), 
                sum(if(month(a.date)=month(b.date) and weekofyear(a.date)=weekofyear(b.date),1,0)),
                sum(if(month(a.date)=month(b.date),1,0)),
                count(*) from
(select * from input_table where date >= ${hiveconf:start} and date <${hiveconf:end}) a, 
(select * from input_table where date <${hiveconf:end}) b 
where year(a.date)=year(b.date) and b.date <= a.date group by a.date;