1
votes

From https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters, it says

If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters:

soup = BeautifulSoup("&ldquo ; Wow!&rdquo ; he said.", 'html.parser')

str(soup)

'“Wow!” he said.'

Is there any way to modify this behavior and make it preserve entities like '&dlquo ;', '&rdquo ;' or '&quot ;' in string processing with BeautifulSoup for html or xml?

1

1 Answers

0
votes

Did you try reading the rest of that documentation section? You can get the entities back by passing formatter="html" to soup.encode:

>>> soup.encode(formatter="html")
b'“ ; Wow!” ; he said.'

Another way is to replace & with & before passing to BeautifulSoup:

>>> html = "&ldquo ; Wow!&rdquo ; he said."
>>> soup = BeautifulSoup(html.replace("&", "&"), 'html.parser')
>>> print(soup.get_text())
&ldquo ; Wow!&rdquo ; he said.