0
votes

A simple question, but I can't seem to find the answer anywhere... Is there any existing method in perl (or perhaps a command line tool) to check if a given XML file contains mixed content?

I just need something that tells me if mixed content is present our not. Although any details if mixed content found would be a bonus. If anything exists then processing the file without loading it completely in memory would also be better as the files I need to analyse are 100's of MBs and even in some cases a few GBs. If nothing exists then I'll start looking at writing something myself.

All the above assumes that an XSD/Schema file is not available for the given XML file.

1

1 Answers

2
votes

The XPath query boolean(//*[text()[normalize-space()] and *]) returns true if there is an element that has both element and non-whitespace text children.

For a streamed algorithm you'll need a stack; at every level in the stack you need to track whether you have encountered non-whitespace text children and/or element children at that level. Not too difficult to achieve with a SAX-like API, though I wouldn't know where to start in Perl.

With XSLT 3.0 streaming I think it can be done with xsl:iterate:

<xsl:mode streamable="yes"/>
<xsl:template match="*">
  <xsl:iterate select="node()">
    <xsl:param name="found-element" select="false()"/>
    <xsl:param name="found-text" select="false()"/>
    <xsl:on-completion>
      <xsl:if test="$found-element and $found-text">
        <out>Found mixed content!!</out>
      </xsl:if>
    </xsl:on-completion>
    <xsl:apply-templates select="."/>
    <xsl:next-iteration>
      <xsl:with-param name="found-element" select="$found-element or self::*"/>
      <xsl:with-param name="found-text" select="$found-text or self::text()[normalize-space()]"/>
    </xsl:next-iteration>
  <xsl:iterate>
</xsl:template>

There's plenty of room for improving this; currently it will give you lots of messages if there's lots of mixed content.