How to perform a running count on null values that resets on non-null values

Question

I am trying to calculate a column that performs a running count of consecutive null values, but the running count will reset upon non-null values.

I am currently trying to achieve this on this version of redshift:

PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.8187

I tried using this window function but this is just continuously incrementing the number for each null.

ROW_NUMBER() OVER (PARTITION BY ID, VAL ORDER BY VAL ROWS UNBOUNDED PRECEDING)

For example if I had a dataset like this:

id  | date  | val
----+-------+-------
  1 |   1/1 | NULL
  1 |   1/2 | NULL 
  1 |   1/3 | NULL 
  1 |   1/4 |  1
  1 |   1/5 | NULL 
  1 |   1/6 | NULL 
  1 |   1/7 |  1 
  2 |   1/8 |  2
  2 |   1/9 | NULL
  2 |   1/1 | NULL
  2 |   1/2 |  1
  2 |   1/3 | NULL
  2 |   1/4 |  0
  2 |   1/5 | NULL
  2 |   1/6 | NULL

I would like the output to look like this:

id  | date  | val   | foo
----+-------+-------+-------
  1 |   1/1 | NULL  |  1
  1 |   1/2 | NULL  |  2
  1 |   1/3 | NULL  |  3
  1 |   1/4 |  1    |
  1 |   1/5 | NULL  |  1 
  1 |   1/6 | NULL  |  2
  1 |   1/7 |  1    |
  2 |   1/8 |  2    |
  2 |   1/9 | NULL  |  1
  2 |   1/1 | NULL  |  2
  2 |   1/2 |  1    |
  2 |   1/3 | NULL  |  1
  2 |   1/4 |  0    |
  2 |   1/5 | NULL  |  1
  2 |   1/6 | NULL  |  2

Is this mock up data? The data you have provided do not provide enough distinct values for running sum to generate your output, you could get only running sum with id, val combo with above example data — demircioglu
This is indeed mockup data. I do have additional fields in my datasets but theyre all transactional figures and I do not know if they could be used as partitions ie LTV, spend velocity etc. — user3348557
I played it little bit and came up with running sums for id, val combo. The sum column goes like this 1,2,3,0,4,5,0,1,0,2... because it's not possible IMO to create the output with above input. Used sum(case when val = 'NULL' then 1 else 0 end) over (partition by id,val order by id,val,date rows between unbounded preceding and current row) as foo — demircioglu
That is good to know. What additional things would be needed to make the above output possible? — user3348557

GMB GMB · Accepted Answer · 2019-10-22T22:25:11

Starter

I think that there is a glitch in your sample data, in the below highlighted records:

id  | date  | val   | foo
----+-------+-------+-------
  1 |   1/1 | NULL  |  1
  1 |   1/2 | NULL  |  2
  1 |   1/3 | NULL  |  3
  1 |   1/4 |  1    |
  1 |   1/5 | NULL  |  1 
  1 |   1/6 | NULL  |  2
  1 |   1/7 |  1    |
  2 |   1/8 |  2    |       --> this record is not in sequence
  2 |   1/9 | NULL  |  1    --> neither this one
  2 |   1/1 | NULL  |  2    --> so this record should have foo = 1, not 2
  2 |   1/2 |  1    |
  2 |   1/3 | NULL  |  1
  2 |   1/4 |  0    |
  2 |   1/5 | NULL  |  1
  2 |   1/6 | NULL  |  2

I simply expurged these three records from the dataset. If you are not ok with that, do not read further...

Answer

This is a variation of the gaps-and-island problem. To solve it, the idea is to build groups made of consecutive null records. For this, we compute row_number()s over two different partitions (by id vs by id and null/not null val). The difference between the row numbers defines the group.

Then, all that is left to do is assign new row numbers to each record having a null val within the group it belongs to.

Query:

select 
    id,
    date,
    val,
    case when val is null
        then row_number() over(partition by id, rn1 - rn2 order by date) 
        else null
    end foo
from (
    select
        t.*,
        row_number() 
            over(order by id, date) rn1,
        row_number() 
            over(partition by id, case when val is null then 1 else 0 end order by date ) rn2
    from mytable t
) t
order by id, date

Demo on DB Fiddle:

| id  | date                     | val | foo |
| --- | ------------------------ | --- | --- |
| 1   | 2019-01-01T00:00:00.000Z |     | 1   |
| 1   | 2019-01-02T00:00:00.000Z |     | 2   |
| 1   | 2019-01-03T00:00:00.000Z |     | 3   |
| 1   | 2019-01-04T00:00:00.000Z | 1   |     |
| 1   | 2019-01-05T00:00:00.000Z |     | 1   |
| 1   | 2019-01-06T00:00:00.000Z |     | 2   |
| 1   | 2019-01-07T00:00:00.000Z | 1   |     |
| 2   | 2019-01-02T00:00:00.000Z | 1   |     |
| 2   | 2019-01-03T00:00:00.000Z |     | 1   |
| 2   | 2019-01-04T00:00:00.000Z | 0   |     |
| 2   | 2019-01-05T00:00:00.000Z |     | 1   |
| 2   | 2019-01-06T00:00:00.000Z |     | 2   |

How to perform a running count on null values that resets on non-null values

1 Answers