I am trying to parse a file such as: http://www.sec.gov/Archives/edgar/data/1409896/000118143112051484/0001181431-12-051484.hdr.sgml
I am using Python 3 and have been unable to find a solution with existing libraries to parse an SGML file with open tags. SGML allows implicitly closed tags. When attempting to parse the example file with LXML, XML, or beautiful soup I end up with implicitly closed tags being closed at the end of the file instead of at the end of line.
For example:
<COMPANY>Awesome Corp
<FORM> 24-7
<ADDRESS>
<STREET>101 PARSNIP LN
<ZIP>31337
</ADDRESS>
This ends up being interpreted as:
<COMPANY>Awesome Corp
<FORM> 24-7
<ADDRESS>
<STREET>101 PARSNIP LN
<ZIP>31337
</ADDRESS>
</ZIP>
</STREET>
</FORM>
</COMPANY>
However, I need it to be interpreted as:
<COMPANY>Awesome Corp</COMPANY>
<FORM> 24-7</FORM>
<ADDRESS>
<STREET>101 PARSNIP LN</STREET>
<ZIP>31337</ZIP>
</ADDRESS>
If there's a non-default parser to pass to LXML/BS4 that can handle this I'm missing it.