1
votes

So, my problem is I'm trying to do something a little un-orthodox. I have a complicated set of XSD files. However I don't want to use these XSD files to verify an XML file; I want to parse these XSDs as XML and interrogate them just as I would a normal XML file. This is possible because XSDs are valid XML. I am using lxml with Python3.

The problem I'm having is with the statement:

<xs:include schemaLocation="sdm-extension.xsd"/>

If I instruct lxml to create an XSD for verifying like this:

schema = etree.XMLSchema(schema_root)

this dependency will be resolved (the file exists in the same directory as the one I've just loaded). HOWEVER, I am treating these as XML so, correctly, lxml just treats this as a normal element with an attribute and does not follow it.

Is there an easy or correct way to extend lxml so that I may have the same or similar behaviour as, say

<xi:include href="metadata.xml" parse="xml" xpointer="title"/>

I could, of course, create a separate xml file manually that includes all the dependencies in the XSD schema. That is perhaps a solution?

2

2 Answers

1
votes

Try this:

def validate_xml(schema_file, xml_file):
    xsd_doc = etree.parse(schema_file)
    xsd = etree.XMLSchema(xsd_doc)
    xml = etree.parse(xml_file)
    return xsd.validate(xml)
0
votes

So it seems like one option is to use the xi:xinclude method and create a separate xml file that includes all the XSDs I want to parse. Something along the lines of:

<fullxsd>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm1-0-0.xsd" parse="xml"/>
<xi:include href="./xsd-cdisc-sdm-1.0.0/sdm-ns-structure.xsd" parse="xml"/>
</fullxsd>

Then use some lxml along the lines of

 def combine(xsd_file):
      with open(xsd_file, 'rb') as f_xsd:
          parser = etree.XMLParser(recover=True, encoding='utf-8',remove_comments=True,                    remove_blank_text=True)

          xsd_source = f_xsd.read()
          root = etree.fromstring(xsd_source, parser)
          incl = etree.XInclude()
          incl(root)

          print(etree.tostring(root, pretty_print=True))

Its not ideal but it seems the proper way. I've looked at custom URI parsers in the lxml but that would mean actually altering the XSDs which seems messier.