I just encoutered some problems while parsing html documents with nekohtml + dom4j.
I found out my xpath expressions were not working anymore because of a new default html xml namespace that was added recently on the html source.
Specification says:
The prefix xmlns is used only to declare namespace bindings and is by definition bound to the namespace name http://www.w3.org/2000/xmlns/. It MUST NOT be declared . Other prefixes MUST NOT be bound to this namespace name, and it MUST NOT be declared as the default namespace. Element names MUST NOT have the prefix xmlns.
But in my html docs, there was added recently (i guess) in the html tag: xmlns="http://www.w3.org/1999/xhtml"
I found 2 solutions:
1) Drop namespace with:
DOMParser parser = new DOMParser();
parser.setFeature("http://xml.org/sax/features/namespaces", false);
parser.parse(url);
According to what NekoHTML faq said.
2) Add a prefix to my xpath, binded to the default html namespace. (It seems it can't bind prefix "empty string" to the namespace i want)
Map<String,String> XPATH_NAMESPACES = new HashMap<String, String>();
XPATH_NAMESPACES.put("my_prefix", "http://www.w3.org/1999/xhtml");
XPath xpath = document.createXPath(xpathExpr);
xpath.setNamespaceURIs(XPATH_NAMESPACES);
Element element = (Element) xpath.selectSingleNode(document);
And then, instead of using //td for exemple, i use //my_prefix:td
I just post these solutions because some people could find this post useful. See also http://www.edankert.com/defaultnamespaces.html#Jaxen_and_Dom4J
But What i would really like to know is:
- Why to use a different namespace from the default one?
- Why would someone switch from http://www.w3.org/2000/xmlns/ to http://www.w3.org/1999/xhtml ?
- Why do we use w3 namespaces in general? Does the namespace have some impact on the browser?
I guess my question could seems obvious to some of you, but i don't really catch what it brings. I've read about the differences between html and xhtml. I guess people using xhtml dtd would rather use this namespace, but what is the real interest apart from the fact it's giving some additional pain to crawlers or other stuff like that?
PS: I've seen that to pass from html to xhtml you have to add both xmlns and xml:lang, for exemple: So it was probably not the aim of the website i was parsing since no xml:lang was added...
Thanks