0
votes

I want to find the xpath to a tag which is inside a CDATA. Below the xml fragment.

<books>
 <book>
  <title></title>
  <content><![CDATA[<p>Hi hello Hw r u?</p><p>We are fine</p><p>Hi babeeee!!!!</p>]]>    </content>
 </book>
</books>

I want to get the data which is inside the first <p> tag inside <content>. Can anybody please give the correct xpath to it?

2
Pretty sure you can't do that. CDATA is simply character data and does not represent any further document elements.Phil

2 Answers

4
votes

CDATA contains arbitrary character data. In contradiction to PCDATA (acronym of parsed character data) it is not parsed, so there is no xpath to "elements" inside of it.

3
votes

As Leif said, the content in the CDATA section is not parsed, so it's just text, even though it looks like markup. You'd have to parse it. Which you could do using Saxon (9.1 or later commercial editions) and saxon:parse. You'd then find it's not well formed, so you'd probably have to resort to a parser such as TagSoup to parse it.

You could also treat it as a string:

<xsl:stylesheet version="1.0"
  xmlns:saxon="http://saxon.sf.net/"
  exclude-result-prefixes="saxon"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
    <Root>
      <!--xsl:value-of select="saxon:parse(/books/book/content)"/-->
      <xsl:for-each select="books/book/content">
        <xsl:value-of select="
          substring-before(
          substring-after( . , '&gt;' ), '&lt;' ) "/>
      </xsl:for-each>
    </Root>
  </xsl:template>
</xsl:stylesheet>