0
votes

Working in Standard SQL/BQ, but trying to figure out a way to get the average difference between two timestamps for each user for consecutive dates?

My data looks like:

    UserId  Login                      
    0       2016-07-14 14:43:30 +0000   
    2       2018-07-09 17:19:55 +0000   
    2       2018-07-10 20:38:37 +0000   
    2       2014-07-12 17:37:20 +0000
    2       2018-07-13 01:08:20 +0000   
    3       2015-12-04 17:17:41 +0000   
    3       2015-12-04 20:14:43 +0000   

So at this point I have:

  SELECT UserId
        ,CASE WHEN(
          (Magic)
          THEN AVG(TIMESTAMP_DIFF(Login2, Login, HOUR) ELSE NULL END))) AS hours
  FROM TABLE
  GROUP BY 1

The (Magic) portion is the stuck portion where I'd like to specify if the dates for each user are consecutive, then take the difference of the latest login date and subtract it by the current login date. Could I apply a lag and say if the difference between the next date and the first date is within a day then that's consecutive? Or is there a partition/window function approach to this?

Login2 would be the second login timestamp of the following consecutive date.

Results that I'd like:

UserId      Average Hours between Consecutive Logins                      
    0       NULL 
    2       17.5 hours
    3       3 hours
  • For UserId 0, it's NULL because there's no second login time to compare to.
  • For UserId 2, it's getting the average between the difference in the two consecutive login dates (07/09 - 07/10 & 07/12 - 07/13) (27 hours + 8 hours / 2) = 5.5 hours
  • For UserId 3, one data point with a same day login with 3 hours difference so it fits the consecutive condition, and got the average difference of the one data point
1
. . Please edit your question and show the results you want to get. Your question mentions a difference between dates, but your rows have multiple dates. It is hard to figure out what you want to do.Gordon Linoff
Thanks for catching that, editing right nowS31
@GordonLinoff Edited. Should've caught that, changes my question - but the ideal results are there.S31
. . I still don't get it. User 2 has 2 logins on consecutive dates. These are about 27 hours apart. I don't see where 5.5 comes from.Gordon Linoff
@GordonLinoff apologies! fixed!S31

1 Answers

2
votes

Below example for BigQuery Standard SQL

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 0 UserId, TIMESTAMP '2016-07-14 14:43:30' Login UNION ALL   
  SELECT 2, '2018-07-09 17:19:55' UNION ALL   
  SELECT 2, '2018-07-10 20:38:37' UNION ALL   
  SELECT 2, '2018-07-12 17:37:20' UNION ALL
  SELECT 2, '2018-07-13 01:08:20' UNION ALL   
  SELECT 3, '2015-12-04 17:17:41' UNION ALL   
  SELECT 3, '2015-12-04 20:14:43' 
)
SELECT
  UserId,
  ROUND(AVG(IF(
      TIMESTAMP_DIFF(TIMESTAMP_TRUNC(next_Login, DAY), TIMESTAMP_TRUNC(Login, DAY), DAY) < 2,
      TIMESTAMP_DIFF(next_Login, Login, MINUTE), NULL
  ))/60, 2) average_hours
FROM (
  SELECT UserId, Login, 
    LEAD(Login) OVER(PARTITION BY UserId ORDER BY Login) next_Login
  FROM `project.dataset.table`
)
GROUP BY UserId
ORDER BY UserId   

with result as

Row     UserId      average_hours    
1       0           null     
2       2           17.41    
3       3           2.95          

Hope you will be able to adjust above to your specifics if any