1
votes

I have data and I need to split each block so as to store each block in separate row. The entire text looks like:

م
مطروح
الحمام
school
الصف
:
الصف الأول
 1
 458316219 
 30709101600371 
ابراهيم وليد ابراهيم ابوالحمد
منافذ فورى
 2
 458361688 
 30702263300318 
احمد ابوالريش فرج عبدالله
منافذ فورى
 3
 458312720 
 30703143300418 
اسلام فتحى محمد ناجى
منافذ فورى
 4
 458790904 
 30606101802299 
اسلام نصار حسين نصار حسين عبد الونيس
منافذ فورى
 5
 458312908 
 30612013300259 
ايمن راضى صالح سلومه
منافذ فورى
 6
 458884564 
 30802203300186 
بسمه محمد ابراهيم ظدم
منافذ فورى
 7
 477625786 
 30708263300235 
بشار نصر الله مصوف السايب
منافذ فورى

I used https://regex101.com/ and I could define the start of each block like that

\d{1,3}\n

This highlights the start of each block

How can I split and separate each block >> and each block has to be in one row?

Here's the HTML for the whole page: https://pastebin.com/nu0dLvch

Here's a link of the full data: https://pastebin.com/dWcu97Wt

I would highlight the needed parts(these are the groups to match). Starting with...

ending with...

There are 22 blocks of data (groups) in total.

Looking at the regex provided by @Wiktor Stribiżew in comments: https://regex101.com/r/dmCNuH/1

match 11 is the first real needed data (match group) though truncates the final line.

enter image description here

After the amazing pattern I got it from Wiktor, I tried to get all the matches

Sub Test()
    Dim a(), s As String, i As Long, j As Long
        Dim bot As New ChromeDriver
    With bot
        .AddArgument "--headless"
        .Get "file:///C:\Sample.html"
        s = .FindElementByCss("table[id='all']").Text

    End With
        a = GetMatches(s, "^\s*\d{1,3}(?:(?:\r\n|[\r\n])(?!\s*\d{1,3}\n).*)+")
        For i = LBound(a) To UBound(a)
            Debug.Print a(i)
        Next i
End Sub

Function GetMatches(ByVal inputString As String, ByVal sPattern As String) As Variant
    Dim arrMatches(), matches As Object, iMatch As Object, s As String, i As Long
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern
        If .Test(inputString) Then
            Set matches = .Execute(inputString)
            ReDim arrMatches(0 To matches.Count - 1)
            For Each iMatch In matches
                arrMatches(i) = iMatch.SubMatches.Item(0)
                i = i + 1
            Next iMatch
        Else
            ReDim arrMatches(0)
            arrMatches(0) = vbNullString
        End If
    End With
    GetMatches = arrMatches
End Function

But this doesn't work for me and throws an error.

2
Try ^\s*\d{1,3}(?:(?:\r\n|[\r\n])(?!\s*\d{1,3}\n).*)* with regExp.Multiline = True.Wiktor Stribiżew
Where from do you have the string in the picture? From a text file, or from an Excel cell? I am asking that, due to the fact that splitting on rows can be easily done with standard VBA. And result to be pasted in excel, on separate cells...FaneDuru
There are a lot of leading-trailing spaces. Would they always be there on the first and second line after the start of a block? This, so we can differentiate between the start of a block and the lines with digits below.JvdV
@YasserKhalil You probably want the regex to stop also before a line that looks like a date string, right? Then you need regex101.com/r/dmCNuH/4, ^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n|\d{4}/\d{2}/\d{2}\n).*)+. If you have line breaks inside a cell in Excel, those are probably CRs, so you need \r instead of \n.Wiktor Stribiżew

2 Answers

2
votes

You may use

^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}

See the regex demo. Use with .Global = True and .MultiLine = True options, you do not need to set .IgnoreCase to True.

NOTE: Since \r, carriage return, is used inside Excel cell values to define a line break, you may need to replace all \n chars in the pattern with \r.

The regex matches a line that may be indented or not and contains 1, 2 or 3 digits, and then grabs the next four lines that do not match the initial pattern.

More details

  • ^ - start of a line
  • \s* - 0 or more whitespace characters
  • \d{1,3} - one to three digits
  • (?:\n(?!\s*\d{1,3}\n).*){4} - a non-capturing group matching four ({4}) occurrences of
    • \n - a newline character (\n) that is...
    • (?!\s*\d{1,3}\n) - (negative lookahead) not immediately followed with:
      • \s* - 0 or more whitespaces
      • \d{1,3} - one, two or three digits
      • \n - a newline char
    • .* - any 0 or more characters other than line break characters, as many as possible.

To extract detailed information with groups, you may use

^[^\S\n]*(\d{1,3})\n\s*(\d{6,})[^\S\n]*\n\s*(\d{14})[^\S\n]*\n(.+)\n(.+)

See this regex demo

  • ^ - start of string
  • [^\S\n]* - 0 or more whitespace characters other than a newline char
  • (\d{1,3}) - one to three digits
  • \n - a newline
  • \s* - any 0+ whitespaces
  • (\d{6,}) - Group 2:
  • [^\S\n]*\n\s* - 0 or more whitespace characters other than a newline char, a newline and then any 0 or more whitespaces
  • (\d{14}) - Group 3: fourteen digits
  • [^\S\n]*\n - 0 or more whitespace characters other than a newline char and a newline char
  • (.+) - Group 4: any one or more characters other than line break chars, as many as possible
  • \n - a newline
  • (.+) - Group 5: any one or more characters other than line break chars, as many as possible
1
votes

Thanks a lot for Wiktor and QHarr for helping me a lot with this issue. I appreciate a lot their help. Here is the final code and I welcome any other ideas or modifications to the code

Sub Test()
    Dim x, a(1 To 1000, 1 To 5), bot As New ChromeDriver, col As Object, sInput As String, sPattern As String, i As Long, j As Long, cnt As Long
    sPattern = "^\s*\d{1,3}(?:\n(?!\s*\d{1,3}\n).*){4}"
    With bot
        .AddArgument "--headless"
        .Get "file:///C:\Sample.html"
        sInput = .FindElementByCss("table[id='all']").Text
    End With
    With CreateObject("VBScript.RegExp")
        .Global = True: .MultiLine = True: .IgnoreCase = True
        .Pattern = sPattern
        If .Test(sInput) Then
            Set col = .Execute(sInput)
            For i = 0 To col.Count - 1
                x = Split(col.Item(i), vbLf)
                cnt = cnt + 1
                For j = LBound(x) To UBound(x)
                    a(i + 1, j + 1) = Application.WorksheetFunction.Clean(Trim(x(j)))
                Next j
            Next i
        End If
    End With
    ActiveSheet.Range("A1").Resize(cnt, UBound(a, 2)).Value = a
End Sub