1
votes

I'm using beautifulsoup to extract metadata from tobacco documents like this: http://legacy.library.ucsf.edu/tid/bxf03e00/xml

soup = BeautifulSoup(input)
meta_data = soup.document.metadata

This correctly identifies all tags except for

<area>GEE,ED/OFFICE; N408</area>

Beautiful soup identifies the area tag as two separate tags:

  • An area tag <area></area> that is empty.
  • An empty tag with the content GEE,ED/OFFICE; N408

Does this bug occur because <area> is an HTML tag? And how do I get beautiful soup to correctly identify GEE,ED/OFFICE; N408 as the content of the <area> tag?

1

1 Answers

1
votes

The central issue is that you haven't told bs4 that it's parsing XML. It assumes HTML - print it out and notice how the parser wraps everything in <html><body> tags.

import requests

req = requests.get('http://legacy.library.ucsf.edu/tid/bxf03e00/xml')

doc = req.text

BeautifulSoup(doc).find('area')
Out[79]: <area></area>

Tell it that it should parse it as XML (it will use lxml to do so, you need to have that dependency installed or this will fail):

BeautifulSoup(doc,'xml').find('area')
Out[80]: <area>GEE,ED/OFFICE; N408</area>