0
votes

I have a dataset that I have aggregated at monthly level. The next part needs me to take, for every block of 3 months, the sum of the data at monthly level.

So essentially my input data (after aggregated to monthly level) looks like:

month year status count_id
08 2021 stat_1 1
09 2021 stat_1 3
10 2021 stat_1 5
11 2021 stat_1 10
12 2021 stat_1 10
01 2022 stat_1 5
02 2022 stat_1 20

and then my output data to look like:

month year status count_id 3m_sum
08 2021 stat_1 1 1
09 2021 stat_1 3 4
10 2021 stat_1 5 8
11 2021 stat_1 10 18
12 2021 stat_1 10 25
01 2022 stat_1 5 25
02 2022 stat_1 20 35

i.e 3m_sum for Feb = Feb + Jan + Dec. I tried to do this using a self join and wrote a query along the lines of

            WITH CTE AS(
            
            SELECT date_part('month',date_col) as month
                  ,date_part('year',date_col) as year
                  ,status
                  ,count(distinct id) as count_id
            
            FROM (date_col, status, transaction_id) as a

            )
            
            SELECT a.month, a.year, a.status, sum(b.count_id) as 3m_sum
            
            from cte as a
            
            left join cte as b on a.status = b.status
            and b.month >= a.month - 2 and b.month <= a.month
            
            group by 1,2,3

This query NEARLY works. Where it falls apart is in Jan and Feb. My data is from August 2021 to Apr 2022. The means, the value for Jan should be Nov + Dec + Jan. Similarly for Feb it should be Dec + Jan + Feb.

As I am doing a join on the MONTH, all the months of Aug - Nov are treated as being values > month of jan/feb and so the query isn't doing the correct sum.

How can I adjust this bit to give the correct sum?

I did think of using a LAG function, but (even though I'm 99% sure a month won't ever be missed), I can't guarantee we will never have a month with 0 values, and therefore my LAG function will be summing the wrong rows.

I also tried doing the same join, but at individual date level (and not aggregating in my nested query) but this gave vastly different numbers, as I want the sum of the aggregation and I think the sum from the individual row was duplicated a lot of stuff I do a COUNT DISTINCT on to remove.

2
group by 1,2,3 are you sure you are using SQL Server? Sample input data would help also - Charlieface
Also, it might help if the expected output better reflected the data actually posted. For example, the 1st data row (month=11 year=2021) has a count_id=10 - yet the 3m_sum = 12? If November 2021 really were the first month of data, the 3m_sums should actually be 10, correct? - SOS
@SOS yes, but actually I start in Aug 2021, and so I just skipped the first few months and started with Nov - Kaish
@Kaish - Yes, but we can't see your data - only what you post here ;-) To minimize confusion it's best that examples be self-contained, and any aggregate totals use the only the values displayed here. - SOS
Start with a calendar table of all months, left join your data to it, then use either lag or a windowed sum - Charlieface

2 Answers

1
votes

You can use a SUM with a window frame of 2 PRECEDING. To ensure you don't miss rows, use a calendar table and left-join all the results to it.

SELECT *,
  SUM(a.count_id) OVER (ORDER BY c.year, c.month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM Calendar c
LEFT JOIN a ON a.year = c.year AND a.month = c.month
WHERE c.year >= 2021 AND c.year <= 2022;

db<>fiddle

You could also use LAG but you would need it twice.

0
votes

It should be @Charlieface's answer - only that I get one different result than you put in your expected result table:

WITH
-- your input - and I avoid keywords like "MONTH" or "YEAR"
-- and also identifiers starting with digits are forbidden -                                                                                                                                                         
indata(mm,yy,status,count_id,sum_3m) AS (
          SELECT 08,2021,'stat_1',1,1
UNION ALL SELECT 09,2021,'stat_1',3,4
UNION ALL SELECT 10,2021,'stat_1',5,8
UNION ALL SELECT 11,2021,'stat_1',10,18
UNION ALL SELECT 12,2021,'stat_1',10,25
UNION ALL SELECT 01,2022,'stat_1',5,25
UNION ALL SELECT 02,2022,'stat_1',20,35
)
SELECT
  *
, SUM(count_id) OVER(
    ORDER BY yy,mm 
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS sum_3m_calc
FROM indata;
-- out  mm |  yy  | status | count_id | sum_3m | sum_3m_calc 
-- out ----+------+--------+----------+--------+-------------
-- out   8 | 2021 | stat_1 |        1 |      1 |           1
-- out   9 | 2021 | stat_1 |        3 |      4 |           4
-- out  10 | 2021 | stat_1 |        5 |      8 |           9
-- out  11 | 2021 | stat_1 |       10 |     18 |          18
-- out  12 | 2021 | stat_1 |       10 |     25 |          25
-- out   1 | 2022 | stat_1 |        5 |     25 |          25
-- out   2 | 2022 | stat_1 |       20 |     35 |          35