1
votes

I have a VB.NET code that have always find and replace the text in the Word Document File(.docx). I am using OpenXml for this process. But I wants to replace only the HTML tagged text and always removing the tags after replace the new text in the document.

my code is:

Public Sub SearchAndReplace(ByVal document As String)

    Dim wordDoc As WordprocessingDocument = WordprocessingDocument.Open(document, True)
    Using (wordDoc)
        Dim docText As String = Nothing
        Dim sr As StreamReader = New StreamReader(wordDoc.MainDocumentPart.GetStream)

        Using (sr)
            docText = sr.ReadToEnd
        End Using

        Dim regexText As Regex = New Regex("<ReplaceText>")
        docText = regexText.Replace(docText, "Hi Everyone!")
        Dim sw As StreamWriter = New StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create))

        Using (sw)
            sw.Write(docText)
        End Using
    End Using
1
You need to use capturing groups. - Amen Jlili

1 Answers

0
votes

Here's to help you resolve your problem.

Imports System.Text.RegularExpressions
Module Module1
    Sub Main()
        Dim Text As String = "Blah<foo>Blah"
        'Prints Text
        Console.WriteLine(Text)
        Dim regex As New Regex("(<)[]\w\/]+(>)")
        'Prints Text after replace the in-between the capturing group 1 and 2. 
        'Capturing group are marked between parenthesis in the regex pattern 
        Console.WriteLine(regex.Replace(Text, "$1foo has been replaced.$2"))
        'Update Text
        Text = regex.Replace(Text, "$1foo has been replaced.$2")
        'Remove starting tag
        Dim p As Integer = InStr(Text, "<")
        Text = Text.Remove(p - 1, 1)
        'Remove trailing tag
        Dim pp As Integer = InStr(Text, ">")
        Text = Text.Remove(pp - 1, 1)
        'Print Text
        Console.WriteLine(Text)
        Console.ReadLine()
    End Sub

End Module

Output:

enter image description here

The above code will not function if you have multiple tags per line.

I would advise not to use regex to parse HTML.