0
votes

I have an application in C# which reads texts from a word (.docx) file using OpenXML.

In general, there is a set of Paragraphs (p) which contain Run elements (r). I can iterate over the Run nodes with

foreach ( var run in para.Descendants<Run>() )
{
  ...
}

In one specific document there is a text "START" which is split into three parts, "ST", "AR" and "T". Each of them is defined by a Run node, but in two cases, the Run node is contained within a "smartTag" node.

<w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="PersonName">
    <w:r w:rsidRPr="00BF444F">
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
            <w:b/>
            <w:bCs/>
            <w:sz w:val="40"/>
            <w:szCs w:val="40"/>
        </w:rPr>
        <w:t>ST</w:t>
    </w:r>
</w:smartTag>
<w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="PersonName">
    <w:r w:rsidRPr="00BF444F">
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
            <w:b/>
            <w:bCs/>
            <w:sz w:val="40"/>
            <w:szCs w:val="40"/>
        </w:rPr>
        <w:t>AR</w:t>
    </w:r>
</w:smartTag>
<w:r w:rsidRPr="00BF444F">
    <w:rPr>
        <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
        <w:b/>
        <w:bCs/>
        <w:sz w:val="40"/>
        <w:szCs w:val="40"/>
    </w:rPr>
    <w:t xml:space="preserve">T</w:t>
</w:r>

As far as I can tell, OpenXML does not support the smartTag node. As a result, it just generates OpenXmlUnknownElement nodes.

What makes this difficult, is that it generates OpenXmlUnknownElement nodes for all of the descendent nodes of the smartTag. This means that I cannot simply get the first child node and cast it to a Run.

Getting the text (via the InnerText property) is easy, but I also need to get the formatting information.

Is there any reasonably easy way to handle this?

At present, my best idea is to write a preprocessor which removes the smart tag nodes.


Edit

Following up on the comment from Cindy Meister.

I am using OpenXml version 2.7.2. As Cindy has pointed out, there is a class SmartTagRun, in OpenXML 2.0. I did not know about that class.

I have found the following information on the page What's new in the Open XML SDK 2.5 for Office

Smart tags

Because smart tags were deprecated in Office 2010, the Open XML SDK 2.5 doesn't support smart tag related Open XML elements. The Open XML SDK 2.5 still can process smart tag elements as unknown elements, however the Open XML SDK 2.5 Productivity Tool for Office validates those elements (see the following list) in Office document files as invalid tags.

So it sounds like a possible solution would be to use OpenXML 2.0.

1
Some thoughts... I would think para.Descendants<Run> should also pick up runs in a smartTag? You're saying the SDK is differentiating between w:r nested in w:smartTag? (I can't test because Word doesn't support creating SmartTags anymore - there was a court case that decided MS was using technology patented by another company so the capability had to stripped out.)Cindy Meister
Assuming yes ^^ then shouldn't it be possible to check the document.xml for these elements (not using the SDK) and strip them out before using the SDK?Cindy Meister
para.Descendents<Run> does not pick up the Runs in the smartTag. The smartTag and all descendent nodes are created as OpenXmlUnknownElement nodes.Phil Jollans
Stripping out the smartTag elements is a good idea, but I'm not exactly sure how to do it. Would I have to unpack the docx file, edit the document.xml file and repack it before using OpenXML, or is there some support OpenXML for preprocessing the file.Phil Jollans
It's been a while since I've actually coded using the SDK, which is why I'm so hesitant... In VBA I know I could do this; I'd think it should be possible using the SDK because (as I recall) it's possible to get the "pure" XML of a Part, or part of a tree? Manipulate it, and then write it back? Certainly the .NET Framework Packaging namespace would allow direct access to document.xml (that's what the SDK is doing in the background). It might be worthwhile to start a new question specifically about reading/editing the XML of an Office ZIP package (IOW no emphasis on the SmartTag part).Cindy Meister

1 Answers

1
votes

The solution is to use Linq to XML (or the System.Xml classes if you like those better) to remove the w:smartTag elements as shown in the following code:

public class SmartTagTests
{
    private const string Xml =
        @"<w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
<w:body>
    <w:p>
        <w:smartTag w:uri=""urn:schemas-microsoft-com:office:smarttags"" w:element=""PersonName"">
            <w:r w:rsidRPr=""00BF444F"">
                <w:rPr>
                    <w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
                    <w:b/>
                    <w:bCs/>
                    <w:sz w:val=""40""/>
                    <w:szCs w:val=""40""/>
                </w:rPr>
                <w:t>ST</w:t>
            </w:r>
        </w:smartTag>
        <w:smartTag w:uri=""urn:schemas-microsoft-com:office:smarttags"" w:element=""PersonName"">
            <w:r w:rsidRPr=""00BF444F"">
                <w:rPr>
                    <w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
                    <w:b/>
                    <w:bCs/>
                    <w:sz w:val=""40""/>
                    <w:szCs w:val=""40""/>
                </w:rPr>
                <w:t>AR</w:t>
            </w:r>
        </w:smartTag>
        <w:r w:rsidRPr=""00BF444F"">
            <w:rPr>
                <w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
                <w:b/>
                <w:bCs/>
                <w:sz w:val=""40""/>
                <w:szCs w:val=""40""/>
            </w:rPr>
            <w:t xml:space=""preserve"">T</w:t>
        </w:r>
    </w:p>
</w:body>
</w:document>";

    [Fact]
    public void CanStripSmartTags()
    {
        // Say you have a WordprocessingDocument stored on a stream (e.g., read
        // from a file).
        using Stream stream = CreateTestWordprocessingDocument();

        // Open the WordprocessingDocument and inspect it using the strongly-
        // typed classes. This shows that we find OpenXmlUnknownElement instances
        // are found and only a single Run instance is recognized.
        using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false))
        {
            MainDocumentPart part = wordDocument.MainDocumentPart;
            Document document = part.Document;

            Assert.Single(document.Descendants<Run>());
            Assert.NotEmpty(document.Descendants<OpenXmlUnknownElement>());
        }

        // Now, open that WordprocessingDocument to make edits, using Linq to XML.
        // Do NOT use the strongly typed classes in this context.
        using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, true))
        {
            // Get the w:document as an XElement and demonstrate that this
            // w:document contains w:smartTag elements.
            MainDocumentPart part = wordDocument.MainDocumentPart;
            string xml = ReadString(part);
            XElement document = XElement.Parse(xml);

            Assert.NotEmpty(document.Descendants().Where(d => d.Name.LocalName == "smartTag"));

            // Transform the w:document, stripping all w:smartTag elements and
            // demonstrate that the transformed w:document no longer contains
            // w:smartTag elements.
            var transformedDocument = (XElement) StripSmartTags(document);

            Assert.Empty(transformedDocument.Descendants().Where(d => d.Name.LocalName == "smartTag"));

            // Write the transformed document back to the part.
            WriteString(part, transformedDocument.ToString(SaveOptions.DisableFormatting));
        }

        // Open the WordprocessingDocument again and inspect it using the 
        // strongly-typed classes. This demonstrates that all Run instances
        // are now recognized.
        using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false))
        {
            MainDocumentPart part = wordDocument.MainDocumentPart;
            Document document = part.Document;

            Assert.Equal(3, document.Descendants<Run>().Count());
            Assert.Empty(document.Descendants<OpenXmlUnknownElement>());
        }
    }

    /// <summary>
    /// Recursive, pure functional transform that removes all w:smartTag elements.
    /// </summary>
    /// <param name="node">The <see cref="XNode" /> to be transformed.</param>
    /// <returns>The transformed <see cref="XNode" />.</returns>
    private static object StripSmartTags(XNode node)
    {
        // We only consider elements (not text nodes, for example).
        if (!(node is XElement element))
        {
            return node;
        }

        // Strip w:smartTag elements by only returning their children.
        if (element.Name.LocalName == "smartTag")
        {
            return element.Elements();
        }

        // Perform the identity transform.
        return new XElement(element.Name, element.Attributes(),
            element.Nodes().Select(StripSmartTags));
    }

    private static Stream CreateTestWordprocessingDocument()
    {
        var stream = new MemoryStream();

        using var wordDocument = WordprocessingDocument.Create(stream, WordprocessingDocumentType.Document);
        MainDocumentPart part = wordDocument.AddMainDocumentPart();
        WriteString(part, Xml);

        return stream;
    }

    #region Generic Open XML Utilities

    private static string ReadString(OpenXmlPart part)
    {
        using Stream stream = part.GetStream(FileMode.Open, FileAccess.Read);
        using var streamReader = new StreamReader(stream);
        return streamReader.ReadToEnd();
    }

    private static void WriteString(OpenXmlPart part, string text)
    {
        using Stream stream = part.GetStream(FileMode.Create, FileAccess.Write);
        using var streamWriter = new StreamWriter(stream);
        streamWriter.Write(text);
    }

    #endregion
}

You could also use the PowerTools for Open XML, which provide a markup simplifier that directly supports the removal of w:smartTag elements.