I'm using beautifulsoup to extract metadata from tobacco documents like this: http://legacy.library.ucsf.edu/tid/bxf03e00/xml
soup = BeautifulSoup(input)
meta_data = soup.document.metadata
This correctly identifies all tags except for
<area>GEE,ED/OFFICE; N408</area>
Beautiful soup identifies the area tag as two separate tags:
- An area tag
<area></area>that is empty. - An empty tag with the content
GEE,ED/OFFICE; N408
Does this bug occur because <area> is an HTML tag?
And how do I get beautiful soup to correctly identify GEE,ED/OFFICE; N408 as the content of the <area> tag?