Parsing Google Earth KML file in Python (lxml, namespaces)

Question

I am trying to parse a .kml file into Python using the xml module (after failing to make this work in BeautifulSoup, which I use for HTML).

As this is my first time doing this, I followed the official tutorial and all goes well until I try to construct an iterator to extract my data by root iteration:

from lxml import etree
tree=etree.parse('kmlfile')

Here is the example from the tutorial I am trying to emulate:

If you know you are only interested in a single tag, you can pass its name to getiterator() to have it filter for you:
for element in root.getiterator("child"):
    print element.tag, '-', element.text

I would like to get all data under 'Placemark', so I tried

for i in tree.getiterterator("Placemark"):
    print i, type(i)

which doesn't give me anything. What does work is:

for i in tree.getiterterator("{http://www.opengis.net/kml/2.2}Placemark"):
    print i, type(i)

I don't understand how this comes about. The www.opengis.net is listed in the tag at the beginning of the document (kml xmlns="http://www.opengis.net/kml/2.2"...) , but I don't understand

how the part in {} relates to my specific example at all
why it is different from the tutorial
and what I am doing wrong

Any help is much appreciated!

You should take the time to read up on XML namespaces in general (there's a very nice and comprehensive write-up on the MSDN) and how XML namespaces are represented by Element Tree, which lxml emulates (also see "Clark’s notation", for historic context). Further reading, for a way out of this mess: The lxml documentation on doing XPath with namespaces. — Tomalak
Exactly. The "default namespace" nothing but is a convenience facility in XML to cut down on the number of times you have to write a certain prefix. Namespace prefixes themselves are nothing but a convenience facility to cut down on the number of times you have to write a namespace URI. — Tomalak
An XML node is attached to its namespace URI. Node name and Namespace URI are an inseparable union. If you want to select a certain node, you have to know its URI. You can do it explicitly, like etree does it in its proprietary {http://www.opengis.net/kml/2.2}Placemark notation, or implicitly, by assigning a handle ("prefix") to a URI and then using that handle, like XPath does it. You are free to choose whatever handle you like, you don't have to use the same handle that was in the XML. Go ahead and register kml as http://www.opengis.net/kml/2.2 and use kml in your XPath queries. — Tomalak
Imagine namespaces as colors. You could assign the handle red to the color code #FF0000. XML even has a facility to define a default color for all elements that don't have their own color defined. When querying the XML through XPath you must specify the color. XPath knows nothing about the "default color" mechanism that XML provides, you must either query explicitly car[namespace-uri() = '#FF0000'] or tell XPath up-front that #FF0000 shall be known as red so that you can query red:car. — Tomalak

patrick patrick · Accepted Answer · 2016-07-06T22:05:55

Here is my solution. So, the most important thing to do is read this as posted by Tomalak. It's a really good description of namespaces and easy to understand.

We are going to use XPath to navigate the XML document. Its notation is similar to file systems, where parents and descendants are separated by slashes /. The syntax is explained here, but note that some commands are different for the lxml implementation.

###Problem

Our goal is to extract the city name: the content of <name> which is under <Placemark>. Here's the relevant XML:

<Placemark> <name>CITY NAME</name>

The XPath equivalent to the non-functional code I posted above is:

tree=etree.parse('kml document')
result=tree.xpath('//Placemark/name/text()')

Where the text() part is needed to get the text contained in the location //Placemark/name.

Now this doesn't work, as Tomalak pointed out, cause the name of these two nodes are actually {http://www.opengis.net/kml/2.2}Placemark and {http://www.opengis.net/kml/2.2}name. The part in curly brackets is the default namespace. It does not show up in the actual document (which confused me) but it is defined at the beginning of the XML document like this:

xmlns="http://www.opengis.net/kml/2.2"

###Solution

We can supply namespaces to xpath by setting the namespaces argument:

xpath(X, namespaces={prefix: namespace})

This is easy enough for the namespaces that have actual prefixes, in this document for instance <gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode> where the gx prefix is defined in the document as xmlns:gx="http://www.google.com/kml/ext/2.2".

However, Xpath does not understand what a default namespace is (cf docs). Therefore, we need to trick it, like Tomalak suggested above: We invent a prefix for the default and add it to our search terms. We can just call it kml for instance. This piece of code actually does the trick:

tree.xpath('//kml:Placemark/kml:name/text()', namespaces={"kml":"http://www.opengis.net/kml/2.2"})

The tutorial mentions that there is also an ETXPath method, that works just like Xpath except that one writes the namespaces out in curly brackets instead of defining them in a dictionary. Thus, the input would be of the style {http://www.opengis.net/kml/2.2}Placemark.

Parsing Google Earth KML file in Python (lxml, namespaces)

1 Answers