1
votes

I am searching for a lib or tool or even some simple code that can parse the Xpath/XSLT data in our XSLT files to produce a Dictionary/List/Tree of all the XML nodes that the XSLT is expecting to work on or find. Sadly everything I am finding is dealing with using XSLT to parse XML rather than parsing XSLT. And the real difficult part I'm dealing with is how flexible XPath is.

For example in the several XSLT files we work with an entry may select on

nodeX/nodeY/nodeNeeded;

OR

../nodeNeeded;

OR

select nodeX then select nodeY then select nodeNeeded; and so forth.

What we would like to do is to be able to parse out that XSLT doc and get a data structure of sorts that explicitly tell us that the XSLT is looking for nodeNeeded in path nodeX, nodeY so that we can custom build the XML data in a minimalism fashion

Thanks!

Here is a mocked up sub-set of data for visualization purposes:

<server_stats>
    <server name="fooServer">
        <uptime>24d52m</uptime>
        <userCount>123456</userCount>
        <loggedInUsers>
            <user name="AnnaBannana">
                <created>01.01.2012:00.00.00</created>
                <loggedIn>25</loggedIn>
                <posts>3</posts>
             </user>
         </loggedInUsers>
         <temperature>82F</temperature>
         <load>72</load>
         <mem_use>45</mem_use>
         <visitors>
             <current>42</current>
             <browsers name="mozilla" version="X.Y.Z">22</browsers>
             <popular_link name="index.html">39</popular_link>
             <history>
                 <max_visitors>789</max_visitors>
                 <average_visitors>42</average_visitors>
             </history>
         </visitors>
     </server>
 </server_stats>

From this one customer may just want create an admin HTML page where they pull the hardware stats out of the tree, and perhaps run some load calculations from the visitor count. Another customer may just want to pull just the visitor count information to display as information on their public site. To have each of these customers system load to be as small as possible we would like to parse their stat selecting XSLT and provide them with just the data they need (which has been requested). Obviously the issue is that one customer may perform a direct select on the visitor count node and another may select the visitors node and select each of the child nodes they want etc.

The 2 hypothetical customers looking for the "current" node in "visitors" might have XSLT looking like:

<xsl:template match="server_stats/server/visitors">
    <xsl:value-of select="current"/>
</xsl:template>

OR

<xsl:template match="server_stats">
     <xsl:for-each select="server">
          <xsl:value-of select="visitors/current"/>
          <xsl:value-of select="visitors/popular_link"/>
     </xsl:for-each>
</xsl:template>

In this example both are trying to select the same node but the way they do it is different and "current" is not all that specific so we also need the path they used to get there since "current" could be nodes for several items. This hurts us from just looking for "current" in their XSLT and because the way they access the path can be very different we cant just search for the whole path either.

So the result we would like is to parse their XSLT and give us say a List of stats:

Customer 1:
visitors/current
Customer 2:
visitors/current
visitors/popular_link

etc.

Some example selects that break the solution provided below which we will be working on solving:

<xsl:variable name="fcolor" select="'Black'"/> results in a /'Black' entry
<xsl:for-each select="server"> we get the entry, but its children don't show it anymore
<xsl:value-of select="../../@name"/>  This was kind of expected, we can try to figure out how to skip attribute based selections but the relative paths show up as I thought they would
<xsl:when test="substring(someNode,1,2)=0 and substring(someNode,4,2)=0 and substring(someNode,7,2)>30">  This one is kind of throwing me, because this shows up as a path item, it's due to the when check in the solution but I don't see any nice solution since the same basic statement could have been checking for a branching path, so this might just be one of those cases we need to post-process or something of that nature.
2

2 Answers

0
votes

That's going to be challenging, because XSLT is so context-dependent. You're right to call this "parsing" because you're going to have to duplicate a lot of the logic that would go into a parser.

My suggestion would be to start with a brute-force approach, and refine it as you find more test cases that it can't handle. Look at a couple of XSLT files and write code that can find the structures you're looking for. Look at a few more and if any new structures appear, refine your code to find those, too.

This will not find every possible way that XSLT and XPath can be used, as a purely empirical approach to parsing these files would, but it will be a much smaller project and will find the structures that whoever developed the files tended to use.

0
votes

It is unrealistic to try reconstructing the structure of the source XML document from just looking at an XSLT transformation that operates on this document.

Most XSLT transformations operate on a class of XML documents -- fore than one specific document type.

For example, the following is one of the most used XSLT transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>
</xsl:stylesheet>

Nothing can be deduced from this transformation about the structure of the XML document(s) that it processes.

There is a huge variety of transformations that just override the template from the above transformation.

For example, this is a useful transformation that renames any element having a particular name, specified in an external parameter:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:param name="pName"/>
 <xsl:param name="pNewName"/>

 <xsl:template match="node()|@*" name="identity">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="*">
  <xsl:if test="not(name() = $pName)">
   <xsl:call-template name="identity"/>
  </xsl:if>

  <xsl:element name="{$pNewName}">
   <xsl:apply-templates select="node()|@*"/>
  </xsl:element>
 </xsl:template>
</xsl:stylesheet>

Once again, absolutely nothing can be said about the names and structure of the source XML document.

UPDATE:

Perhaps something like this:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="xsl:template[@match]">
  <xsl:variable name="vPath" select="string(@match)"/>

  <xsl:value-of select="concat('&#xA;', $vPath)"/>

  <xsl:apply-templates select="*">
   <xsl:with-param name="pPath" select="$vPath"/>
  </xsl:apply-templates>
 </xsl:template>

 <xsl:template match="*">
  <xsl:param name="pPath"/>

  <xsl:apply-templates select="*">
   <xsl:with-param name="pPath" select="$pPath"/>
  </xsl:apply-templates>
 </xsl:template>

 <xsl:template match="xsl:for-each">
  <xsl:param name="pPath"/>

  <xsl:variable name="vPath">
   <xsl:choose>
    <xsl:when test="starts-with(@select, '/')">
      <xsl:value-of select="@select"/>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="concat($pPath, '/', @select)"/>
    </xsl:otherwise>
   </xsl:choose>
  </xsl:variable>

  <xsl:value-of select="concat('&#xA;', $vPath)"/>

  <xsl:apply-templates select="*">
   <xsl:with-param name="pPath" select="$vPath"/>
  </xsl:apply-templates>
 </xsl:template>

 <xsl:template match="xsl:if | xsl:when">
  <xsl:param name="pPath"/>

  <xsl:variable name="vPath">
   <xsl:choose>
    <xsl:when test="starts-with(@test, '/')">
      <xsl:value-of select="@test"/>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="concat($pPath, '/', @test)"/>
    </xsl:otherwise>
   </xsl:choose>
  </xsl:variable>

  <xsl:value-of select="concat('&#xA;', $vPath)"/>

  <xsl:apply-templates select="*">
   <xsl:with-param name="pPath" select="$pPath"/>
  </xsl:apply-templates>
 </xsl:template>

 <xsl:template match="*[@select]">
  <xsl:param name="pPath"/>

  <xsl:variable name="vPath">
   <xsl:choose>
    <xsl:when test="starts-with(@select, '/')">
      <xsl:value-of select="@select"/>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="concat($pPath, '/', @select)"/>
    </xsl:otherwise>
   </xsl:choose>
  </xsl:variable>

  <xsl:value-of select="concat('&#xA;', $vPath)"/>

  <xsl:apply-templates select="*">
   <xsl:with-param name="pPath" select="$pPath"/>
  </xsl:apply-templates>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the following XSLT stylesheet:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>

    <xsl:template match="/">
        <xsl:apply-templates/>
    </xsl:template>

    <xsl:template match="server_stats">
        <xsl:for-each select="server">
            <xsl:value-of select="visitors/current"/>
            <xsl:value-of select="visitors/popular_link"/>

            <xsl:for-each select="site">
              <xsl:value-of select="defaultPage/Url"/>
            </xsl:for-each>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

the following wanted result is produced:

/
server_stats
server_stats/server
server_stats/visitors/current
server_stats/visitors/popular_link
server_stats/site
server_stats/defaultPage/Url

Do Note: Not only is such analysis incomplete, but it must be regarded with a grain of salt. These are results of static analysis. It may happen in practice that out of 100 paths only 5-6 of these are accessed in 99% of the time. Static analysis cannot give you such information. Dynamic analysis tools (similar to profilers) can return much more precise and useful information.