1
votes

I have a large spreadsheet with headers (180k+ rows) with unique IDs in A, start date in B, and end date in C. There are multiple rows per ID and the start and end dates overlap.

I need to find any gaps in the date ranges for each ID. I've written a few different formulas and macros, tried and tweaked VBA scripts I've found. I've attempted a power query and power pivot grasping at straws, but if Excel doesn't crash I'm not getting a usable output.

Example data:

ID start end
100 1/1/2015 3/1/2015
100 3/1/2015 1/1/2300
100 1/1/2018 1/1/2019
096 7/1/2020 1/1/2021
182 9/17/2017 1/1/2018
182 1/1/2018 1/1/2019
607 1/1/2015 9/1/2015
607 9/1/2015 1/1/2017
607 1/1/2018 1/1/2020
607 1/1/2021 1/1/2300

I would like to combine or consolidate these to remove extra lines for the IDs that do not have any gaps in the date range, but will leave an extra row for the IDs that do:

ID start end
100 1/1/2015 1/1/2300
096 7/1/2020 1/1/2021
182 9/17/2017 1/1/2019
607 1/1/2015 1/1/2017
607 1/1/2018 1/1/2020
607 1/1/2021 1/1/2300

I don't need it to combine; though, for presentations sake it would be nice. Also, I would settle for something that is able to tell me which IDs have a gap in the range, even if it doesn't combine the dates or remove extra rows.

I did find a script from another site that almost did the job, though because the date ranges can't all be sorted in proper order, like ID 100 in the example, it creates an extra line when it shouldn't.

Sub Consolidate_Dates()
    
    Dim cell As Range
    Dim Nextrow As Long
    Dim Startdate As Date
    
    Nextrow = Range("A" & Rows.Count).End(xlUp).Row + 2
    Startdate = Range("B2").Value
    
    Application.ScreenUpdating = False
    For Each cell In Range("A2", Range("A2").End(xlDown))
        If cell.Value <> cell.Offset(1).Value Or _
           cell.Offset(0, 2).Value < cell.Offset(1, 1).Value - 1 Then
            Range("A" & Nextrow).Resize(1, 3).Value = cell.Resize(1, 3).Value
            Range("B" & Nextrow).Value = Startdate
            Nextrow = Nextrow + 1
            Startdate = cell.Offset(1, 1).Value
        End If
    Next cell
    Application.ScreenUpdating = True
End sub
3
I’m not understanding what you’re trying to do. Can you show some desired output data and explain further?Mark S
Do you want to combine two rows where the start of one immediately follows the end of the other eg 1/1/2021 to 31/1/2021 and 1/2/2021 to 28/2/2021 giving 1/1/2021 to 28/2/2021CDP1802

3 Answers

0
votes

Try this. Make sure that the data range is sorted by id and start date before you begin.

Option Explicit

Public Enum ColId
    ColId_Id = 1
    ColId_Start_Date
    ColId_End_Date
End Enum

Public Sub Test()

    Dim row As Integer
    
    ' Skip the header row & the first data row. 
    ' Start on the second data row.
    row = 3
    
    With Worksheets("Sheet1")
        
        ' Loop until you run out of data
        Do While .Cells(row, ColId_Id) <> ""
            
            ' Compare the current row to the previous row.
            ' We're looking for the same id value and a start date that is 
            ' within or adjoins the previous row's date range.
            If .Cells(row, ColId_Id).Value = .Cells(row - 1, ColId_Id).Value _
            And .Cells(row, ColId_Start_Date).Value >= .Cells(row - 1, ColId_Start_Date).Value _
            And .Cells(row, ColId_Start_Date).Value <= .Cells(row - 1, ColId_End_Date).Value _
            And .Cells(row, ColId_End_Date).Value > .Cells(row - 1, ColId_End_Date).Value _
            Then

                ' Update the previous row and delete the current row.
                .Cells(row - 1, ColId_End_Date).Value = .Cells(row, ColId_End_Date).Value
                .Rows(row).Delete

            Else

                ' Next row.
                row = row + 1

            End If
        
        Loop
    
    End With

End Sub
0
votes

Here is a Power Query solution:

Please read the comments in the code and explore the applied steps window to understand the algorithm better, but:

  • create a List of the included dates in each range for each ID
    • combine them into a single list
  • create a List of ALL possible dates from the earliest date to the latest date for each ID
  • If all the dates in the "ALL" range are included in the combined list, then we have no gaps.
  • Create two separate tables
    • one with a Group for the no gap list
    • a second for the list with gaps which we then expand
  • Append the two tables.

note that many steps cannot be done from the UI

M Code

Paste into the Advanced Editor

ensure you change the table name in Line2 to your actual table name

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"ID", Int64.Type}, {"start", type date}, {"end", type date}}),

//Turn each date range into a list
    #"Added Custom" = Table.AddColumn(#"Changed Type", "dateList", each 
        List.Dates([start],
                    Duration.Days([end]-[start])+1,
                    #duration(1,0,0,0))),

  //Group the rows by ID
  /*Generate columns where 
      actual date ranges are combined into a list,
      and a list of the Full date range for that ID*/
    #"Grouped Rows" = Table.Group(#"Added Custom", {"ID"}, 
        {{"All", each _, type table [ID=nullable number, start=nullable date, end=nullable date, dateList=list]},
        {"combinedDates", each List.Distinct(List.Combine([dateList]))},
        {"startToEnd", each List.Dates(List.Min([start]),
                                Duration.Days(List.Max([end])-List.Min([start]))+1,
                                #duration(1,0,0,0))}        
        }),

  //if the full list and the combined list Match, then there are no gaps and return True else False        
    #"Added Custom1" = Table.AddColumn(#"Grouped Rows", 
          "Custom", each List.IsEmpty(List.Difference([startToEnd],[combinedDates]))),
    #"Added Custom2" = Table.AddColumn(#"Added Custom1", 
          "start", each if [Custom] = false then null
                else List.Min([combinedDates])),
    #"Added Custom3" = Table.AddColumn(#"Added Custom2", 
          "end", each if [Custom] = false then null 
                else List.Max([combinedDates])),

  //create the table of Trues which we will NOT expand
  trueTbl = Table.SelectRows(#"Added Custom3", each [Custom] = true),
    trueRemoveColumns = Table.RemoveColumns(trueTbl,
            {"All", "combinedDates", "startToEnd","Custom"}),
    trueTyped = Table.TransformColumnTypes(trueRemoveColumns,
            {{"start", type date}, {"end", type date}}),

   //create the table of False which we WILL expand 
  falseTbl = Table.SelectRows(#"Added Custom3", each [Custom] = false),
    expandFalse = Table.ExpandTableColumn(falseTbl, "All", 
            {"start", "end"}, {"start.1", "end.1"}),
    falseRemoveColumns = Table.RemoveColumns(expandFalse,
            {"combinedDates", "startToEnd", "Custom", "start", "end"}),
    falseRenameColumns = Table.RenameColumns(falseRemoveColumns,
            {{"start.1", "start"}, {"end.1", "end"}}),

//Combine the tables
    comb = Table.Combine({trueTyped, falseRenameColumns})
in 
   comb

enter image description here

0
votes

This uses an object orientated approach. It first adds a collection of ID objects into a dictionary, one object for each unique ID. To each ID object it adds the collection of date spans that ID has. As each span is added the start data is compared with the previous end date to decide if there is a gap or not. The data must be sorted by ID, Start Date

Put the input data on sheet1, the output goes to sheet2. It shows the gaps in columns D and E. A script to create test data is also shown.

Option Explicit

Sub Consolidate_Dates()
    
    Const SHT_DATA = "Sheet1"
    Const SHT_OUTPUT = "Sheet2"

    Dim wb As Workbook, ws As Worksheet
    Dim iLastRow As Long, i As Long, n As Integer
    Dim dict As Object, id As String, objID As clsID
    Dim t0 As Single, ar As Variant, obj As Variant
    t0 = Timer
   
    Set dict = CreateObject("Scripting.Dictionary")
  
    ' scan data on sheet 1
    Set wb = ThisWorkbook
    Set ws = wb.Sheets(SHT_DATA)
    iLastRow = ws.Cells(rows.count, "A").End(xlUp).row
    ar = ws.Range("A2:C" & iLastRow).Value2 ' put data in array
    For i = 1 To UBound(ar)
        id = Trim(ar(i, 1))
        If Not dict.exists(id) Then
            Set objID = New clsID
            objID.id = id
            dict.Add id, objID
        End If
        dict(id).AddSpan CDate(ar(i, 2)), CDate(ar(i, 3))
    Next

    ' results sheet
    With wb.Sheets(SHT_OUTPUT)
        .Cells.Clear
        .Range("A1:E1") = Array("ID", "Start", "End", "Gap Start", "Gap End")
        .Columns("B:E").NumberFormat = "mm/dd/yyyy"
    End With
    ReDim ar(1 To iLastRow, 1 To 5) ' reuse part of array for output
    i = 1
    For Each obj In dict.items
        Set objID = obj
        ' output spans and gaps
        For n = 1 To obj.spansOut.count
            ar(i, 1) = objID.id
            ar(i, 2) = objID.spansOut(n).StartDate
            ar(i, 3) = objID.spansOut(n).EndDate
            ' show gaps
            If n > 1 Then
                ar(i - 1, 4) = objID.spansOut(n - 1).EndDate
                ar(i - 1, 5) = objID.spansOut(n).StartDate
            End If
            i = i + 1
        Next
    Next
    
    ' finish
    Set dict = Nothing
    With wb.Sheets(SHT_OUTPUT)
        .Range("A2:E" & i).Value2 = ar
        .Columns("A:E").AutoFit
        .Activate
        .Range("A1").Select
    End With
    Erase ar
   
    MsgBox Format(i - 1, "#,###") & " rows output to " & SHT_OUTPUT, vbInformation, Int(Timer - t0) & " seconds"
End Sub

A Class module named clsID

Option Explicit

Public id As String ' unique id
Public hasGaps As Boolean
Public spans As New Collection
Public spansOut As New Collection

Sub AddSpan(dtStart As Date, dtEnd As Date)
    
    Dim spNew As New clsSpan, spLast As clsSpan
    spNew.StartDate = dtStart
    spNew.EndDate = dtEnd
    spans.Add spNew, CStr(spans.count + 1)

    If spansOut.count = 0 Then
        spansOut.Add spNew, "1"
        hasGaps = False
    Else
        Set spLast = spansOut(spansOut.count)
        If spNew.StartDate < spLast.StartDate Then
            MsgBox "Start dates not sorted correctly for " & id, vbCritical
        ElseIf spNew.StartDate > spLast.EndDate Then
            ' add new span
            spansOut.Add spNew, CStr(spansOut.count + 1)
            hasGaps = True
        ElseIf spNew.EndDate > spLast.EndDate Then
            ' extend last span
            spLast.EndDate = spNew.EndDate
        Else
            ' no change
        End If
    End If
End Sub

A Class module named clsSpan

Option Explicit

Public StartDate As Date
Public EndDate As Date

Script to generate random test data

Sub testdata()
    Const ROW_COUNT = 200000
    Dim dt1 As Date, i As Long
    Sheet1.Cells.Clear
    For i = 2 To ROW_COUNT + 1
        Sheet1.Cells(i, 1) = 1000 + Int(9000 * Rnd)
        dt1 = CDate("1/1/2000") + Int(3650 * Rnd)
        Sheet1.Cells(i, 2) = dt1
        Sheet1.Cells(i, 3) = dt1 + Int(1000 * Rnd)
    Next

    With Sheet1.Sort
        .SortFields.Clear
        .SortFields.Add key:=Range("A1:A" & i)
        .SortFields.Add key:=Range("B1:B" & i)
        .SetRange Range("A1:C" & i)
        .Header = xlYes
        .MatchCase = False
        .Orientation = xlTopToBottom
        .SortMethod = xlPinYin
        .Apply
    End With
    Sheet1.Activate
    Sheet1.Range("A" & ROW_COUNT + 1).Select
   
    MsgBox Format(ROW_COUNT, "#,###") & " rows created and sorted"
End Sub