2
votes

I am trying to find the minimum value in a certain element from an XML document (it's actually a HTML table that is translated to XML). However, this does not work as intended.

The query is similar to the one used in How can I use XPath to find the minimum value of an attribute in a set of elements?. It looks like this:

/table[@id="search-result-0"]/tbody/tr[
    not(substring-before(td[1], " ") > substring-before(../tr/td[1], " "))
]

Executed on the example XML

<table class="tablesorter" id="search-result-0">
    <thead>
        <tr>
            <th class="header headerSortDown">Preis</th>
            <th class="header headerSortDown">Zustand</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td width="45px">15 CHF</td>
            <td width="175px">Ausgepack und doch nie gebraucht</td>
        </tr>
        <tr>
            <td width="45px">20 CHF</td>
            <td width="175px">Ausgepack und doch nie gebraucht</td>
        </tr>
        <tr>
            <td width="45px">25 CHF</td>
            <td width="175px">Ausgepack und doch nie gebraucht</td>
        </tr>
        <tr>
            <td width="45px">35 CHF</td>
            <td width="175px">Ausgepack und doch nie gebraucht</td>
        </tr>
        <tr>
            <td width="45px">14 CHF</td>
            <td width="175px">Gebraucht, aber noch in Ordnung</td>
        </tr>
        <tr>
            <td width="45px">15 CHF</td>
            <td width="175px">Gebraucht, aber noch in Ordnung</td>
        </tr>
        <tr>
            <td width="45px">15 CHF</td>
            <td width="175px">Gebraucht, aber noch in Ordnung</td>
        </tr>
    </tbody>
</table>

the query returns the following result:

<tr>
<td width="45px">15 CHF</td>
<td width="175px">Ausgepack und doch nie gebraucht</td>
</tr>
-----------------------
<tr>
<td width="45px">14 CHF</td>
<td width="175px">Gebraucht, aber noch in Ordnung</td>
</tr>
-----------------------
<tr>
<td width="45px">15 CHF</td>
<td width="175px">Gebraucht, aber noch in Ordnung</td>
</tr>
-----------------------
<tr>
<td width="45px">15 CHF</td>
<td width="175px">Gebraucht, aber noch in Ordnung</td>
</tr>

Why are there more nodes returned than one? There should only be exactly one node returned as there is only a single minimum. Does anybody see what's wrong with the query? It should only return the node containing the 14 CHF.

Results obtained using http://xpath.online-toolz.com/tools/xpath-editor.php

3

3 Answers

3
votes

TML has already pointed out why your current path expression does not work, but has not suggested a working alternative.

The reason is simple, as @Tomalak has said:

I agree with Mathias. This actually is impossible in XPath 1.0 without changing the input XML.

I add this answer to elaborate on the way you'd have to preprocess your XML before searching for the minimum amount of CHF. And remember: This is so complicated because you asked for a solution in XPath 1.0. With XPath 2.0, your problem could be solved with a single path expression.


XML Design

I think that your question illustrates why XML design is actually essential when working with XML. Why? Because your problem boils down to the following: Your XML is designed in a way that makes it difficult to manipulate the content. More precisely, in a td element like this:

<td width="45px">15 CHF</td>

There is an amount (as a number) and a currency, both in the text node of the td element. If your XML input was designed in a more clever or canonical way, it would look like:

<td width="45px" currency="CHF">15</td>

See the difference? Now, different kinds of content are clearly separated from each other.


XPath Revised

Assuming that in the newly designed XML, the only content of a tr/td[1] element is the number, the XPath expression by Pavel Minaev that you used can be made to work:

/table[@id="search-result-0"]/tbody/tr[not(td[1] > ../tr/td[1])][1]

XML Result (tested with the tool you use)

<tr>
<td width="45px">14</td>
<td width="175px">Ausgepack und doch nie gebraucht</td>
</tr>

Why does Pavel's expression not work, simply because I add substring-before?

You found part of the answer yourself already. It has to do with how sequences of items are handled in XPath 1.0 functions.

substring-before() is an XPath 1.0 function that expects two arguments, both of them strings. And, most importantly, if you define a sequence of strings as the first argument of substring-before(), only the first string will be processed, the others will be ignored.

Pavel's answer, adapted to this question:

tr[not(td[1] > ../tr/td[1])][1]

Relies on the fact that the second part of the expression, ../tr/td[1], finds all first td child elements of all tr elements of tbody. There is no function involved, and there is nothing wrong with a sequence as the operand of >.

If we need substring-before() because the text content is actually both a number (that we want) and a currency (that we'd like to ignore), we have to wrap it around both parts of the expression:

tr[not(substring-before(td[1],' ') > substring-before(../tr/td[1],' '))][1]

No problem on the left side of >, because there is only one td[1] for the current tr. But on the right, there is a sequence of nodes, namely ../tr/td[1]. Sadly, substring-before() is only capable of processing the first of them.

See the answer by @TML for the consequences of that.

1
votes

The XPath query you're using here would only find the "minimum" in cases where there are no duplicate values, and the values are sorted prior to being written into nodes; this is because it's only comparing the current value substring-before(td[1], " ") to the first value found substring-before(../tr/td[1], " "). To break down the comparisons:

[1] not(15 > 15)
[2] not(20 > 15)
[3] not(25 > 15)
[4] not(35 > 15)
[5] not(14 > 15)
[6] not(15 > 15)
[7] not(15 > 15)

Comparisons 1, 5, 6, and 7 evaluate to true (the left-hand side is NOT greater than the right-hand side).

0
votes

In the meantime I decided to use XSLT instead. This is the style sheet that I came up with:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml">

    <xsl:output method="text" omit-xml-declaration="yes" indent="no" encoding="UTF-8"/>
    <xsl:strip-space elements="*"/> 

    <xsl:template match="//table[@id=\'search-result-0\']/tbody">
        <ul>
            <xsl:for-each select="tr/td[@width=\'45px\']">
                <xsl:sort select="substring-before(., \' \')" data-type="number" order="ascending"/>

                <xsl:if test="position() = 1">
                     <xsl:value-of select="substring-before(., \' \')"/>
                </xsl:if>
            </xsl:for-each>
        </ul>
    </xsl:template>

    <xsl:template match="text()"/> <!-- ignore the plain text -->

</xsl:stylesheet>