Group records by consecutive dates when dates are not exactly consecutive

Question

I have some data that contains dates. I'm trying to group the data by consecutive dates, however, the dates are not exactly consecutive. Here is an example:

DateColumn              | Value
------------------------+-------
2017-01-18 01:12:34.107 | 215426 <- batch no. 1
2017-01-18 01:12:34.113 | 215636
2017-01-18 01:12:34.623 | 123516
2017-01-18 01:12:34.633 | 289926
2017-01-18 04:58:42.660 | 259063 <- batch no. 2
2017-01-18 04:58:42.663 | 261830
2017-01-18 04:58:42.893 | 219835
2017-01-18 04:58:42.907 | 250165
2017-01-18 05:18:14.660 | 134253 <- batch no. 3
2017-01-18 05:18:14.663 | 134257
2017-01-18 05:18:14.667 | 134372
2017-01-18 05:18:15.040 | 181679
2017-01-18 05:18:15.043 | 226368
2017-01-18 05:18:15.043 | 227070

The data is generated in batches and each row inside a batch takes a few milliseconds to generate. I'm trying to group the results as follows:

Date1                   | Date2                   | Count
------------------------+-------------------------+------
2017-01-18 01:12:34.107 | 2017-01-18 01:12:34.633 | 4
2017-01-18 04:58:42.660 | 2017-01-18 04:58:42.907 | 4
2017-01-18 05:18:14.660 | 2017-01-18 05:18:15.043 | 6

It is safe to assume that if two consecutive rows are more than 1 minute apart then they belong to a different batch.

I tried solutions involving ROW_NUMBER function but they work with consecutive dates (date difference between two rows is fixed). How can I achieve desired result when the difference is fuzzy?

Please note that a batch could be much longer than a minute. For example a batch might consist of rows starting from 2017-01-01 00:00:00 and ending at 2017-01-01 00:05:00 consisting of ~3000 rows and each row few dozen or hundred millisecond apart. What is for certain is that batches are at least 1 minute apart.

re the "is it safe..." we can't say - the business or other domain experts will be the only ones who can say. If in batches you need an identifier for each batch and use that — mmmmmm
do the last two rows in question have the same datetime value or is it a typo? — Vamsi Prabhala
@vkp this is weird but not a typo. Perhaps two rows were inserted within 1 millisecond or the actual time got rounded to nearest datetime value. — Salman A

Gurwinder Singh Gurwinder Singh · Accepted Answer · 2017-01-18T13:59:28

Try this:

select min(t.dateColumn) date1, max(t.dateColumn) date2, count(*)
from (
    select t.*, sum(val) over (
            order by t.dateColumn
            ) grp
    from (
        select t.*, case 
                when datediff(ms, lag(t.dateColumn, 1, t.dateColumn) over (
                            order by t.dateColumn
                            ), t.dateColumn) > 60000
                    then 1
                else 0
                end val
        from your_table t
        ) t
    ) t
group by grp;

Produces:

uses the analytic function lag() to mark starting of next batch based on the difference of datecolumn from the last one and then use analytic sum() on it to create group of batches and then group by it to find required aggregates.

There may be some misclassification in groups due to rounding issues with DATETIME. From MSDN,

datetime values are rounded to increments of .000, .003, or .007 seconds, as shown in the following table.

Here is the same query rewritten using CTEs:

WITH cte1(DateColumn, ValueColumn) AS (
    -- Insert your query that returns a datetime column and any other column
    SELECT
        SomeDate,
        SomeValue
    FROM SomeTable
    WHERE SomeColumn IS NOT NULL
), cte2 AS (
    -- This query adds a column called "val" that contains
    -- 1 when current row date - previous row date > 1 minute
    -- 0 otherwise
    SELECT
        cte1.*,
        CASE WHEN DATEDIFF(MS, LAG(DateColumn, 1, DateColumn) OVER (ORDER BY DateColumn), DateColumn) > 60000 THEN 1 ELSE 0 END AS val
    FROM cte1
), cte3 AS (
    -- This query adds a column called "grp" that numbers 
    -- the groups using running sum over the "val" column
    SELECT
        cte2.*,
        SUM(val) OVER (ORDER BY DateColumn) AS grp
    FROM cte2
)
SELECT
    MIN(DateColumn) Date1,
    MAX(DateColumn) Date2,
    COUNT(ValueColumn) [Count]
FROM cte3
GROUP BY grp

Group records by consecutive dates when dates are not exactly consecutive

3 Answers