I have a dataset that I have aggregated at monthly level. The next part needs me to take, for every block of 3 months, the sum of the data at monthly level.
So essentially my input data (after aggregated to monthly level) looks like:
month | year | status | count_id |
---|---|---|---|
08 | 2021 | stat_1 | 1 |
09 | 2021 | stat_1 | 3 |
10 | 2021 | stat_1 | 5 |
11 | 2021 | stat_1 | 10 |
12 | 2021 | stat_1 | 10 |
01 | 2022 | stat_1 | 5 |
02 | 2022 | stat_1 | 20 |
and then my output data to look like:
month | year | status | count_id | 3m_sum |
---|---|---|---|---|
08 | 2021 | stat_1 | 1 | 1 |
09 | 2021 | stat_1 | 3 | 4 |
10 | 2021 | stat_1 | 5 | 8 |
11 | 2021 | stat_1 | 10 | 18 |
12 | 2021 | stat_1 | 10 | 25 |
01 | 2022 | stat_1 | 5 | 25 |
02 | 2022 | stat_1 | 20 | 35 |
i.e 3m_sum for Feb = Feb + Jan + Dec. I tried to do this using a self join and wrote a query along the lines of
WITH CTE AS(
SELECT date_part('month',date_col) as month
,date_part('year',date_col) as year
,status
,count(distinct id) as count_id
FROM (date_col, status, transaction_id) as a
)
SELECT a.month, a.year, a.status, sum(b.count_id) as 3m_sum
from cte as a
left join cte as b on a.status = b.status
and b.month >= a.month - 2 and b.month <= a.month
group by 1,2,3
This query NEARLY works. Where it falls apart is in Jan and Feb. My data is from August 2021 to Apr 2022. The means, the value for Jan should be Nov + Dec + Jan. Similarly for Feb it should be Dec + Jan + Feb.
As I am doing a join on the MONTH, all the months of Aug - Nov are treated as being values > month of jan/feb and so the query isn't doing the correct sum.
How can I adjust this bit to give the correct sum?
I did think of using a LAG function, but (even though I'm 99% sure a month won't ever be missed), I can't guarantee we will never have a month with 0 values, and therefore my LAG function will be summing the wrong rows.
I also tried doing the same join, but at individual date level (and not aggregating in my nested query) but this gave vastly different numbers, as I want the sum of the aggregation and I think the sum from the individual row was duplicated a lot of stuff I do a COUNT DISTINCT on to remove.
group by 1,2,3
are you sure you are using SQL Server? Sample input data would help also - Charliefaceleft join
your data to it, then use eitherlag
or a windowedsum
- Charlieface