I'm scraping values from HTML pages using XPath inside of a java program to get to a specific tag and occasionally using regular expressions to clean up the data I receive.
After some research, I landed on HTML Cleaner ( http://htmlcleaner.sourceforge.net/ ) as the most reliable way to parse raw HTML into a good XML format. HTML Cleaner, however, only supports XPath 1.0 and I find myself needing functions like 'contains'. for instance, in this piece of XML:
<div>
<td id='1234 foo 5678'>Hello</td>
</div>
I would like to be able to get the text 'Hello' with the following XPath:
//div/td[contains(@id, 'foo')]/text()
Is there any way to get this functionality? I have several ideas, but would prefer not to reinvent the wheel if I don't need to:
- If there is a way to call HTML Cleaner's evaluateXPath and return a TagNode (which I have not found), I can use an XML serializer on the returned TagNode and chain together XPaths to achieve the desired functionality.
- I could use HTML Cleaner to clean to XML, serialize it back to a string, and use that with another XPath library, but I can't find a good java XPath evaluator that works on a string.
- Using TagNode functions like getElementsByAttValue, I could essentially recreate XPath evaluation and insert in the contains functionality using String.contains
Short question: Is there any way to use XPath contains on HTML inside an existing Java Library?
contains
is in XPath 1.0: w3.org/TR/xpath/#function-contains – Wayne