Extract CDATA tagged values from .kml in R

Question

I'd like to extract from a .kml file the value(s) for description using R.

Here is the file:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2"
 xmlns:gx="http://www.google.com/kml/ext/2.2"
 xmlns:atom="http://www.w3.org/2005/Atom">
 <Document>
 <open>1</open>
 <visibility>1</visibility>
 <name><![CDATA[2013-07-06 4:18pm]]></name>
 ...
 <Placemark>
 <name><![CDATA[2013-07-06 4:18pm (Start)]]></name>
 <description><![CDATA[]]></description>
 <TimeStamp><when>2013-07-06T20:18:56.000Z</when></TimeStamp>
 <styleUrl>#start</styleUrl>
 <Point>
 <coordinates>-78.353348,45.020615,340.29998779296875</coordinates>
 </Point>
 </Placemark>
 <Placemark id="tour">
 <name><![CDATA[2013-07-06 4:18pm]]></name>
 <description><![CDATA[]]></description>
 ...
 <gx:Track>
 <when>2013-07-06T20:18:56.000Z</when>
 <gx:coord>-78.353348 45.020615 340.29998779296875</gx:coord>
 <when>2013-07-06T20:19:12.000Z</when>
 <gx:coord>-78.353315 45.020644 340.29998779296875</gx:coord>
 <when>2013-07-06T22:12:23.000Z</when>
 <gx:coord>-78.353108 45.020736 342.29998779296875</gx:coord>
 <ExtendedData>
  ...
  <Placemark>
  <name><![CDATA[2013-07-06 4:18pm (End)]]></name>
  <description><![CDATA[Created by Google My Tracks on Android.

  Name: 2013-07-06 4:18pm
  Activity type: cycling
  Description: -
  Total distance: 49.62 km (30.8 mi)
  Total time: 1:53:28
  Moving time: 1:50:17
  Average speed: 26.24 km/h (16.3 mi/h)
  Average moving speed: 27.00 km/h (16.8 mi/h)
 Max speed: 61.20 km/h (38.0 mi/h)
 Average pace: 2.29 min/km (3.7 min/mi)
 Average moving pace: 2.22 min/km (3.6 min/mi)
 Fastest pace: 0.98 min/km (1.6 min/mi)
 Max elevation: 406 m (1333 ft)
 Min elevation: 265 m (868 ft)
 Elevation gain: 690 m (2263 ft)
 Max grade: 12 %
 Min grade: -11 %
 Recorded: 2013-07-06 4:18pm
  ]]></description>
 ...
 </Placemark>
 </Document>
 </kml>

And here is what I want to extract, the text contained in

 <description><![CDATA[Created by Google My Tracks on Android.: ]]></description>

i.e.:

  Name: 2013-07-06 4:18pm
  Activity type: cycling
  Description: -
  Total distance: 49.62 km (30.8 mi)
  Total time: 1:53:28
  Moving time: 1:50:17
  Average speed: 26.24 km/h (16.3 mi/h)
  Average moving speed: 27.00 km/h (16.8 mi/h)
 Max speed: 61.20 km/h (38.0 mi/h)
 Average pace: 2.29 min/km (3.7 min/mi)
 Average moving pace: 2.22 min/km (3.6 min/mi)
 Fastest pace: 0.98 min/km (1.6 min/mi)
 Max elevation: 406 m (1333 ft)
 Min elevation: 265 m (868 ft)
 Elevation gain: 690 m (2263 ft)
 Max grade: 12 %
 Min grade: -11 %
 Recorded: 2013-07-06 4:18p

xmlToList gives me, I think NULL because the CDATA tag means the stuff following is not processed by the parser:

xml <- xmlTreeParse("test1.kml", useInternalNodes=TRUE)
xmllist <- xmlToList(xml)
xmllist$Document$Placemark$description
[[1]]
NULL

I think that is what this means "The term CDATA is used about text data that should not be parsed by the XML parser ...Everything inside a CDATA section is ignored by the parser. A CDATA section starts with "" "

The following will not work for me either, perhaps for the same reason related to CDATA:

z1 <- xpathApply(xml, "//description", xmlValue)
z1
list()

Can anyone help me extract the text in the file?

Here is a link to the file: https://docs.google.com/file/d/0B__iOdFGJbXYOHJGbWJVNW0tS3M/edit?usp=sharing

Jake Burkhead Jake Burkhead · Accepted Answer · 2013-07-09T05:31:07

doc <- xmlTreeParse("test1.kml", useInternalNodes = TRUE)
root <-xmlRoot(doc)

xmlValue(root[["Document"]][["name"]])

R> xmlValue(root[["Document"]][["name"]])
 [1] "2013-07-06 4:18pm"

Also xmlToDataFrame(root) and xmlToDataFrame(doc) return that value in the name column. Using xmlToList on either root or doc returns NULL for the value of any CData. I'm looking at the name node because copy and pasting your example doesn't xmlParse. From my own little tests it looks like this should work on any CData.

Extract CDATA tagged values from .kml in R

3 Answers