2
votes

I am trying to create code in Word VBA that will automatically save (as PDF) and name a document based on it's content, which is in text and not fields. Luckily the formatting is standardized and I already know how to save it. I tested my regex elsewhere to make sure it pulls what I am looking for. The trouble is I need to extract the matched statement, convert it to a string, and save it to an object (so I have something to pass on to the code where it names the document).

The part of the document I need to match is below, from the start of "Program" through the end of the line and looks like:

Program: Program Name (abr)

and the regex I worked out for this is "Program:[^\n]"

The code I have so far is below, but I don't know how to execute the regex in the active document, convert the output to a string and save to an object:

Sub RegExProgram()

Dim regEx
Dim pattern As String

Set regEx = CreateObject("VBScript.RegExp")
regEx.IgnoreCase = True
regEx.Global = False
regEx.pattern = "Program\:[^\n]"

(missing code here)

End Sub

Any ideas are welcome, and I am sorry if this is simple and I am just overlooking something obvious. This is my first VBA project, and most of the resources I can find suggest replacing using regex, not saving extracted text as string. Thank you!

3
While that was a good description of how to create regular expressions, it did not address my largest problem... I do not want to replace text, I want to extract it and save it to an object as a string. Also I am not using excel to work within cells or worksheets, I am using Word so when I use regEx.test() or regEx.Execute() I don't know where to reference.schradera
I'm assuming that the "Program Name (abr)" part of your string will be different things depending on the document?Pat Jones
@Pat_Jones, yes it will be different. I have tested the regex above using sample docs and an online regex tester, and that seems to work. It grabs everything from "Program: ..." through the end of the line.schradera

3 Answers

4
votes

Try this:

You can find documentation for the RegExp class here.

Dim regEx as Object
Dim matchCollection As Object
Dim extractedString As String

Set regEx = CreateObject("VBScript.RegExp")
With regEx
  .IgnoreCase = True
  .Global = False    ' Only look for 1 match; False is actually the default.
  .Pattern = "Program: ([^\r]+)"  ' Word separates lines with CR (\r)
End With

' Pass the text of your document as the text to search through to regEx.Execute().
' For a quick test of this statement, pass "Program: Program Name (abr)"
set matchCollection = regEx.Execute(ActiveDocument.Content.Text)

' Extract the first submatch's (capture group's) value - 
' e.g., "Program Name (abr)" - and assign it to variable extractedString.
extractedString = matchCollection(0).SubMatches(0)
  • I've modified your regex based on the assumption that you want to capture everything after Program: through the end of the line; your original regex would only have captured Program:<space>.

    • Enclosing [^\r]+ (all chars. through the end of the line) in (...) defines a so-called subexpression (a.k.a. capture group), which allows selective extraction of only the substring of interest from what the overall pattern captures.
  • The .Execute() method, to which you pass the string to search in, always returns a collection of matches (Match objects).
    Since the .Global property is set to False in your code, the output collection has (at most) 1 entry (at index 0) in this case.

  • If the regular expression has subexpressions (1 in our case), then each entry of the match collection has a nonempty .SubMatches collection, with one entry for each subexpression, but note that the .SubMatches entries are strings, not Match objects.

  • Match objects have properties .FirstIndex, .Length, and Value (the captured string). Since the .Value property is the default property, it is sufficient to access the object itself, without needing to reference the .Value property (e.g., instead of the more verbose matchCollection(0).Value to access the captured string (in full), you can use shortcut matchCollection(0) (again, by contrast, .SubMatches entries are strings only).

2
votes

If you're just looking for a string that starts with "Program:" and want to go to the end of the line from there, you don't need a regular expression:

Public Sub ReadDocument()

Dim aLine As Paragraph
Dim aLineText As String

Dim start As Long

For Each aLine In ActiveDocument.Paragraphs

    aLineText = aLine.Range.Text
    start = InStr(aLineText, "Program:")

    If start > 0 Then
        my_str = Mid(aLineText, start)
    End If

Next aLine

End Sub

This reads through the document line by line, and stores your match in the variable "my_str" when it encounters a line that has the match.

2
votes

Lazier version:

a = Split(ActiveDocument.Range.Text, "Program:")
If UBound(a) > 0 Then 
    extractedString = Trim(Split(a(1), vbCr)(0))
End If

If I remember correctly, paragraphs in Word end with vbCr ( \r not \n )