1
votes

I have table like the following:

person Session session_start session_end half_hour_start half_hour_end
A A001 9/13/2020 7:58:00 PM 9/13/2020 8:10:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM
A A002 9/13/2020 8:02:00 PM 9/13/2020 8:13:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM
A A003 9/13/2020 8:27:00 PM 9/13/2020 8:28:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM
B B001 9/13/2020 8:20:00 PM 9/13/2020 8:28:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM
B B002 9/13/2020 8:28:00 PM 9/13/2020 8:43:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM

The goal is to count distinct minutes per every person for all session within 30 min block (half_hour_start - half_hour_end). Count starts from minute 00, ends by minute 29 (so there're 30 distinct minutes in total).

So that even in case if a person had session starting at 9/13/2020 8:00:01 PM and ending at 9/13/2020 8:00:05 PM, this person will still get credit for 1 minute - minute '00'. We're interested not in count of full minutes, but in count of all distinct minutes where session had place, even partially.

I need to get results like:

---old version---

person distinct_minutes_count
A 14
B 10

---new version---

person distinct_minutes_count
A 16
B 10

(which could be coming from:

person Session session_start session_end half_hour_start half_hour_end distinct_minutes_count_per_person
A A001 9/13/2020 7:58:00 PM 9/13/2020 8:10:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM 16
A A002 9/13/2020 8:02:00 PM 9/13/2020 8:13:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM 16
A A003 9/13/2020 8:27:00 PM 9/13/2020 8:28:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM 16
B B001 9/13/2020 8:20:00 PM 9/13/2020 8:28:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM 10
B B002 9/13/2020 8:28:00 PM 9/13/2020 8:43:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM 10

)

The intermediate steps needed, probably, are:

person Session session_start session_end half_hour_start half_hour_end distinct_minute_per_session distinct_minutes_count_per_session distinct_minute_per_person distinct_minutes_count
A A001 9/13/2020 7:58:00 PM 9/13/2020 8:10:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM 00,01,02,03,04,05,06,07,08,09,10 11 00,01,02,03,04,05,06,07,08,09,10,11,12,13,27,28 16
A A002 9/13/2020 8:02:00 PM 9/13/2020 8:13:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM 02,03,04,05,06,07,08,09,10,11,12,13 12 00,01,02,03,04,05,06,07,08,09,10,11,12,13,27,28 16
A A003 9/13/2020 8:27:00 PM 9/13/2020 8:28:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM 27,28 2 00,01,02,03,04,05,06,07,08,09,10,11,12,13,27,28 16
B B001 9/13/2020 8:20:00 PM 9/13/2020 8:28:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM 20,21,22,23,24,25,26,27,28 9 20,21,22,23,24,25,26,27,28,29 10
B B002 9/13/2020 8:28:00 PM 9/13/2020 8:43:00 PM 9/13/2020 8:00:00 PM 9/13/2020 8:30:00 PM 28,29 2 20,21,22,23,24,25,26,27,28,29 10

But I don't see options of creating list values for a column in Snowflake.

3
your image are only empty grid ..scaisEdge
my apologies - fixeduser14276110
Hi - shouldn't the distinct_minutes for A be 13, not 14? 10 for A001 and then 3 for the non-overlapping part of A002? I'm also not clear why you are keeping the Session in your resultset as the minutes apply to the user not the session i.e. there aren't 14 (or, rather, 13) distinct minutes for session A001 - there are either 10 (if A001 has priority over A002) or 2 (if A002 has priority over A001). Please could you explain in more detail what you are doing here?NickW
Hi @NickW! thank you for commenting. The count of minutes is correct - we assign the following indexes - minute 00 [1], minute 01 [2], etc, so that if a person started with minute 00 and ended at minute 10, (s)he had 00-10 minutes, 11 in total. that's why we cut our 30 min block at minute 29, not 30. I added intermediate count of distinct minutes for sessions for convenience. No session has priority over another - we're interested in total results per person, not per session.user14276110

3 Answers

0
votes

To generate multiple rows, you can use a JS table UDF:

CREATE OR REPLACE FUNCTION generate_minutes(STARTING timestamp, ENDING timestamp)
RETURNS TABLE (V VARCHAR)
LANGUAGE JAVASCRIPT
AS '{
    processRow: function get_params(row, rowWriter, context){
       for(var i=row.STARTING/60/1000; i<=row.ENDING/60/1000; i++) {
           rowWriter.writeRow({V: i}); 
       }
    }
}';


with data as (
select 'A' person, 'A001' session,  '9/13/2020 7:58:00'::timestamp session_start,   '9/13/2020 8:10:00'::timestamp session_end, '9/13/2020 8:00:00'::timestamp half_hour_start, '9/13/2020 8:30:00'::timestamp half_hour_end 
union all select 'A', 'A002', '9/13/2020 8:02:00', '9/13/2020 8:13:00', '9/13/2020 8:00:00', '9/13/2020 8:30:00'
union all select 'A', 'A003', '9/13/2020 8:27:00', '9/13/2020 8:28:00', '9/13/2020 8:00:00', '9/13/2020 8:30:00'
)

select person, count(distinct x.v) distinct_minutes
from data
  , table(generate_minutes(
      greatest(session_start, half_hour_start)
      , least(session_end, timestampadd(minute, -1, half_hour_end)))
     ) x
where session_start<=half_hour_end and session_end >= half_hour_start
group by person

enter image description here

0
votes

the following should get you most of the way there. You can probably drop a lot of the columns in the CTEs - I just had them in there for checking the intermediary steps. There are probably lots of other scenarios that are not in your sample data so I didn't check but you probably should e.g. if a session session_start/session_end is entirely within the duration of another session

WITH TEMP1 AS
    (
        SELECT
            PERSON, SESSION, SESSION_START, SESSION_END, HALF_HOUR_START, HALF_HOUR_END
          , GREATEST(SESSION_START,HALF_HOUR_START)                                                     CALC_START
          , LEAST(SESSION_END,HALF_HOUR_END)                                                            CALC_END
          , DATEDIFF(MINUTE, GREATEST(SESSION_START,HALF_HOUR_START), LEAST(SESSION_END,HALF_HOUR_END)) DUR -- Adjust to allow for partial minutes if necessary, by adding 1 to this value
          , RANK() OVER (PARTITION BY PERSON, HALF_HOUR_START ORDER BY SESSION_START ASC) RNK
        FROM
            SESSION_DATA
    )
  , TEMP2 AS
    (
        SELECT
            T1.PERSON, T1.SESSION, T1.DUR, T1.RNK, T1.CALC_START, T1.CALC_END
          , CASE
                WHEN T2.PERSON IS NULL -- Must be first session in the half-hour slot for the person
                    THEN T1.DUR
                    ELSE DATEDIFF(MINUTE, T2.CALC_END,T1.CALC_END)
            END DISTINCT_DUR 
        FROM
            TEMP1 T1
            LEFT OUTER JOIN -- join to previous session within the same half-hour slot for the same user
                TEMP1 T2
                ON
                    T1.PERSON              = T2.PERSON
                    AND T1.HALF_HOUR_START = T2.HALF_HOUR_START
                    AND T1.RNK             = (T2.RNK + 1)
    )
SELECT
    PERSON
  , SUM(DISTINCT_DUR)
FROM
    TEMP2
GROUP BY
    PERSON
;

EDIT AFTER COMMENT

As I said, this was a starting point to help you and I hadn't tested this with every possible combination of session data. If you make the following change to the CASE statement it will work when you include session A003:

,CASE 
  WHEN T2.PERSON IS NULL THEN T1.DUR
  WHEN T1.CALC_START > T2.CALC_END THEN T1.DUR
 ELSE DATEDIFF(MINUTE, T2.CALC_END,T1.CALC_END) END DISTINCT_DUR
0
votes

so using these two tables of periods of interest and session seperate to start with

with periods as (
    select p_from::timestamp_ntz as p_from
        ,p_to::timestamp_ntz as p_to
    from values 
    ('2020-09-13 20:00:00', '2020-09-13 20:30:00' ),
    ('2020-09-13 20:30:00', '2020-09-13 21:00:00' )
    v(p_from, p_to)
), sessions as (
    select user
        ,session
        ,s_from::timestamp_ntz as s_from
        ,s_to::timestamp_ntz as s_to
    from values 
    ('A','A001','2020-09-13 19:58:00','2020-09-13 20:10:00'),
    ('A','A002','2020-09-13 20:02:00','2020-09-13 20:13:00'),
    ('A','A003','2020-09-13 20:27:00','2020-09-13 20:28:00'),
    ('B','B001','2020-09-13 20:20:00','2020-09-13 20:28:00'),
    ('B','B002','2020-09-13 20:28:00','2020-09-13 20:43:00')
    v(user, session, s_from, s_to)

and then we mixed them together first to recreate what you had as p_s_mix and then triming the session times into the boundaries of the periods (aka is done with t_from and t_to) we can then get the date minutes parts of those (aka from_min & `to_min)

), p_s_mix as (
    select s.*, p.*
    from sessions as s
    join periods as p
    where s.s_from < p.p_to and s.s_to > p.p_from
    order by 1,2
), p_s_trim AS (
    select user
        ,session
        --,s_from
        --,s_to
        ,p_from
        ,p_to
        ,greatest(p_from, s_from) as t_from
        ,least(p_to, s_to) as t_to
        ,date_part('minute',t_from) as from_min
        ,date_part('minute',t_to) as to_min
    from p_s_mix

Then we can mix that with a gross table with 60 minute slots pre-coded, and just count the distinct minutes. Just for the fun the array_agg is provided to show the arrays in the form given in your example.

), gross AS (
    SELECT * from values
    (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),
    (10),(11),(12),(13),(14),(15),(16),(17),(18),(19),
    (20),(21),(22),(23),(24),(25),(26),(27),(28),(29),
    (30),(31),(32),(33),(34),(35),(36),(37),(38),(39),
    (40),(41),(42),(43),(44),(45),(46),(47),(48),(49),
    (50),(51),(52),(53),(54),(55),(56),(57),(58),(59)
    v(slot)
)
select p.user
    ,p.p_from
    ,p.p_to
    ,count(distinct g.slot) as c_mintues
    ,array_agg(distinct g.slot) within group(order by g.slot) as distinct_mintues
 from p_s_trim as p
 join gross as g
    on p.from_min <= g.slot and p.to_min >= g.slot
 group by 1,2,3
 order by 1,2,3
;

Giving:

USER  P_FROM                   P_TO                     C_MINTUES  DISTINCT_MINTUES 
A     2020-09-13 20:00:00.000  2020-09-13 20:30:00.000  16         [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 27, 28]
B     2020-09-13 20:00:00.000  2020-09-13 20:30:00.000  11         [ 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
B     2020-09-13 20:30:00.000  2020-09-13 21:00:00.000  14         [ 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]