5
votes

I am trying to do an alarm flood calculation in snowflake. I created the below dataset using a snowflake window function. So if the value is greater or equal to 3, then the alarm flood will start and for the next 0 value, it will end. So in the below example, the alarm flood started at "9:51' and ended at "9:54" which was for 3 minutes. The next flood started at "9:57" and ended at "10:02" which is for 5 minutes.FYI, value at 9:59 is 3, but as a flood is already started, we don't have to consider it. The next flood is at 10:03 but there is no 0 value, so we have to consider edge value 10:06. So total time in flood is 3+5+4= 12 minutes.

   DateTime    Value
3/10/2020 9:50  1
3/10/2020 9:51  3
3/10/2020 9:52  1
3/10/2020 9:53  2
3/10/2020 9:54  0
3/10/2020 9:55  0
3/10/2020 9:56  1
3/10/2020 9:57  3
3/10/2020 9:58  2
3/10/2020 9:59  3
3/10/2020 10:00 2
3/10/2020 10:01 2
3/10/2020 10:02 0
3/10/2020 10:03 3
3/10/2020 10:04 1
3/10/2020 10:05 1
3/10/2020 10:06 1

so, in short, I am expecting below output

enter image description here

I tried below SQL but it does not give me the correct output, it fails in second flood time (as there again value 3 before next 0)

select t.*,
       (case when value >= 3
             then datediff(minute,
                           datetime,
                           min(case when value = 0 then datetime end) over (order by datetime desc)
                          )
        end) as diff_minutes
from t;
3
What is it you are looking for exactly? Are you struggling with a SQL statement that will accomplish this? If so, what SQL have you tried? You might also want to add a generic SQL tag to this, since I don't think the solution will be Snowflake-specific.Mike Walton

3 Answers

2
votes

I'm not the most proud of this code, but it works and gives a starting place. I'm sure it can be cleaned up or simplified. and I haven't assessed performance for larger tables.

The key insight that I used is that if you add the date_diff to the date, then you can find situations where they both add to the same value, meaning that they are both counting to the same "0" record. Hopefully this concept is helpful if nothing else.

Also, the first cte is a semi-hacky way to get that 4 at the end of your results.

--Add a fake zero at the end of the table to provide a value for
-- comparing high values that have not been resolved
-- added a flag so this fake value can be removed later
with fakezero as
(
SELECT datetime, value, 1 flag
FROM test

UNION ALL

SELECT dateadd(minute, 1, max(datetime)) datetime, 0 value, 0 flag
FROM test  
)

-- Find date diffs between high values and subsequent low values
,diffs as (
select t.*,
       (case when value >= 3
             then datediff(minute,
                           datetime,
                           min(case when value = 0 then datetime end) over (order by datetime desc)
                          )
        end) as diff_minutes
from fakezero t
)

--Fix cases where two High values are "resolved" by the same low value
--i.e. when adding the date_diff to the datetime results in the same timestamp
-- this means that the prior high value record that still hasn't been "resolved"
select
  datetime
  ,value
  ,case when 
      lag(dateadd(minute, diff_minutes, datetime)) over(partition by value order by datetime)
      = dateadd(minute, diff_minutes, datetime)
    then null 
    else diff_minutes 
  end as diff_minutes
from diffs
where flag = 1
order by datetime;
1
votes
WITH data as (
  select time::timestamp as time, value from values
    ('2020-03-10 9:50', 1 ),
    ('2020-03-10 9:51', 3 ),
    ('2020-03-10 9:52', 1 ),
    ('2020-03-10 9:53', 2 ),
    ('2020-03-10 9:54', 0 ),
    ('2020-03-10 9:55', 0 ),
    ('2020-03-10 9:56', 1 ),
    ('2020-03-10 9:57', 3 ),
    ('2020-03-10 9:58', 2 ),
    ('2020-03-10 9:59', 3 ),
    ('2020-03-10 10:00', 2 ),
    ('2020-03-10 10:01', 2 ),
    ('2020-03-10 10:02', 0 ),
    ('2020-03-10 10:03', 3 ),
    ('2020-03-10 10:04', 1 ),
    ('2020-03-10 10:05', 1 ),
    ('2020-03-10 10:06', 1 )
     s( time, value)
) 
select 
    a.time
    ,a.value
    ,min(trig_time)over(partition by reset_time_group order by time) as first_trigger_time
    ,iff(a.time=first_trigger_time, datediff('minute', first_trigger_time, reset_time_group), null) as trig_duration
from (
select d.time
   ,d.value 
   ,iff(d.value>=3,d.time,null) as trig_time
   ,iff(d.value=0,d.time,null) as reset_time
   ,max(time)over(order by time ROWS BETWEEN 1 PRECEDING AND UNBOUNDED FOLLOWING) as max_time
   ,coalesce(lead(reset_time)ignore nulls over(order by d.time), max_time) as lead_reset_time
   ,coalesce(reset_time,lead_reset_time) as reset_time_group
from data as d
) as a
order by time;

this gives the results you seem to expect/describe..

TIME                     VALUE  FIRST_TRIGGER_TIME         TRIG_DURATION
2020-03-10 09:50:00.000    1        
2020-03-10 09:51:00.000    3    2020-03-10 09:51:00.000    3
2020-03-10 09:52:00.000    1    2020-03-10 09:51:00.000    
2020-03-10 09:53:00.000    2    2020-03-10 09:51:00.000    
2020-03-10 09:54:00.000    0    2020-03-10 09:51:00.000    
2020-03-10 09:55:00.000    0        
2020-03-10 09:56:00.000    1        
2020-03-10 09:57:00.000    3    2020-03-10 09:57:00.000    5
2020-03-10 09:58:00.000    2    2020-03-10 09:57:00.000    
2020-03-10 09:59:00.000    3    2020-03-10 09:57:00.000    
2020-03-10 10:00:00.000    2    2020-03-10 09:57:00.000    
2020-03-10 10:01:00.000    2    2020-03-10 09:57:00.000    
2020-03-10 10:02:00.000    0    2020-03-10 09:57:00.000    
2020-03-10 10:03:00.000    3    2020-03-10 10:03:00.000    3
2020-03-10 10:04:00.000    1    2020-03-10 10:03:00.000    
2020-03-10 10:05:00.000    1    2020-03-10 10:03:00.000    
2020-03-10 10:06:00.000    1    2020-03-10 10:03:00.000    

So how it works is we find the times of triggering, and the time of resets, then the max_time is worked out, for the last row edge case. After that we find the next reset_time forwards, and use the max_time if there is none, and then select the current reset time or prior lead_reset_time, for the work you are doing here this steps could be ignored, as your data cannot trigger and reset of the same row. And given we are doing the math on the trigger row, the reset row knowing which group it was apart of doesn't matter.

Then we break into a new select layer, as we have reached snowflakes limit for nested/interrelated SQL, and do a min across the reset_group to find the first trigger time, which we then compare to the row time and do a date diff on.

Of side note date_diff is a little naive in it's math, and '2020-01-01 23:59:59' '2020-01-02 00:00:01' are 2 seconds apart, but that are 1 minute apart and 1 hour apart and 1 day, because the function casts the timestamps to the selected unit (and truncates) and then differences those results..

To get the final batch having the value 4 as asked for in the request, alter the lead_reset_time line to:

,coalesce(lead(reset_time)ignore nulls over(order by d.time), dateadd('minute', 1, max_time)) as lead_reset_time

to move this max_time forward by one minute, if your wanted to assume outside of having data in the future time that existing row state of 10:06 is valid for 1 minute. Which is not how I would do it... but there the code you want..

1
votes

The javascript udf version:

select d, v, iff(3<=v and 1=row_number() over (partition by N order by d),
    count(*) over (partition by N), null) trig_duration
from t, lateral flood_count(t.v::float) 
order by d;

Where flood_count() is defined as:

create or replace function flood_count(V float) 
returns table (N float)
language javascript AS
$${

  initialize: function() { 
    this.n = 0 
    this.flood = false
  },

  processRow: function(row, rowWriter) { 
    if (3<=row.V && !this.flood) {
        this.flood = true
        this.n++
    }
    else if (0==row.V) this.flood=false
    rowWriter.writeRow({ N: this.flood ? this.n : null })  
  },

}$$;

Assuming this input:

create or replace table t as
select to_timestamp(d, 'mm/dd/yyyy hh:mi') d, v 
from values
    ('3/10/2020 9:50',  1),
    ('3/10/2020 9:51',  3),
    ('3/10/2020 9:52',  1),
    ('3/10/2020 9:53',  2),
    ('3/10/2020 9:54',  0),
    ('3/10/2020 9:55',  0),
    ('3/10/2020 9:56',  1),
    ('3/10/2020 9:57',  3),
    ('3/10/2020 9:58',  2),
    ('3/10/2020 9:59',  3),
    ('3/10/2020 10:00', 2),
    ('3/10/2020 10:01', 2),
    ('3/10/2020 10:02', 0),
    ('3/10/2020 10:03', 3),
    ('3/10/2020 10:04', 1),
    ('3/10/2020 10:05', 1),
    ('3/10/2020 10:06', 1)
    t(d,v)
;