0
votes

I want to extract information of various tables embedded in a XML file similarly to how SAS XML mapper does

I came across the below documentation of SAS XML mapper and wants to implement similar functionality i.e identify details of multiple tables contained in a single XML file using R *"SAS XML Mapper automatically analyzes the structure of an XML file (or XML Schema) and generates basic metadata. The process consolidates all instances of a specific XPath into a unique candidate field, counts the number of occurrences, captures maximum field length, keeps a representative sample of data instances, and heuristically suggests a data type (string, numeric, date, etc.) based on the sampling. The availability of this information makes identifying table, row, and column components much simpler."

1
have a look at the XML package cran.r-project.org/web/packages/XML/index.htmlRentrop
Agreed, check the examples in ?XML::readHTMLTable (XML package has to be installed). In addition, please follow the request in the R tag description here on stackoverflow, which says: "Please supplement your question with a minimal reproducible example."lukeA
@lukeA I have provided link for the XML dataset. While using SAS XML wrapper I was able to get 5 tables in a single XML. I went ythrough the XML package document but could not find anything which can help in identfying multiple table structure in a single XMLSam

1 Answers

1
votes

I do not know what tables you are looking for or what SAS XML does. However, maybe this is something you can build up on:

library(XML)
download.file("https://www.dropbox.com/s/5hd7g4obztr41f4/20120518_100340289.xml?dl=1", tf <- tempfile(fileext = ".xml"))
doc <- xmlParse(tf)
lst <- xmlToList(doc)
str(lst[1:2], 2) # structure of first two list elements (2 nesting levels)
# List of 2
#  $ CostCentre:List of 10
#   ..$ Name                   : chr "0320 3300000320 Galeomma"
#   ..$ ExternalReferenceNumber: NULL
#   ..$ DateCreated            : chr "2008-06-28T22:31:40"
#   ..$ DateModified           : chr "2008-11-07T11:23:31"
#   ..$ Creator                :List of 3
#   ..$ Modifier               :List of 3
#   ..$ IsActive               : chr "0"
#   ..$ Company                :List of 2
#   ..$ BusinessGroup          :List of 2
#   ..$ .attrs                 :Formal class 'XMLAttributes' [package "XML"] with 1 slot
#  $ CostCentre:List of 10
#   ..$ Name                   : chr "0321 3300000321 Golar Viking"
#   ..$ ExternalReferenceNumber: NULL
#   ..$ DateCreated            : chr "2008-06-28T22:31:40"
#   ..$ DateModified           : chr "2008-11-07T12:50:21"
#   ..$ Creator                :List of 3
#   ..$ Modifier               :List of 3
#   ..$ IsActive               : chr "0"
#   ..$ Company                :List of 2
#   ..$ BusinessGroup          :List of 2
#   ..$ .attrs                 :Formal class 'XMLAttributes' [package "XML"] with 1 slot

do.call(rbind.data.frame, lapply(lst[1:2], "[", c("Name", "DateCreated")))
#                                     Name         DateCreated
# CostCentre      0320 3300000320 Galeomma 2008-06-28T22:31:40
# CostCentre1 0321 3300000321 Golar Viking 2008-06-28T22:31:40