Can anyone recommend me a java library to allow me XPath Queries over URLs? I've tried JAXP without success.
Thank you.
Can anyone recommend me a java library to allow me XPath Queries over URLs? I've tried JAXP without success.
Thank you.
There are several different approaches to this documented on the Web:
Using HtmlCleaner
Using Jericho
I have tried a few different variations of these approaches, i.e. HtmlParser plus the Java DOM parser, and JSoup plus Jaxen, but the combination that worked best is HtmlCleaner plus the Java DOM parser. The next best combination was Jericho plus Jaxen.
You could use TagSoup together with Saxon. That way you simply replace any XML SAX parser used with TagSoup and the XPath 2.0 or XSLT 2.0 or XQuery 1.0 implementation works as usual.
Use Xsoup
. According to the docs, it's faster than HtmlCleaner
. Example
@Test
public void testSelect() {
String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
"<table><tr><td>a</td><td>b</td></tr></table></html>";
Document document = Jsoup.parse(html);
String result = Xsoup.compile("//a/@href").evaluate(document).get();
Assert.assertEquals("https://github.com", result);
List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
Assert.assertEquals("a", list.get(0));
Assert.assertEquals("b", list.get(1));
}
Link to Xsoup - https://github.com/code4craft/xsoup