BeautifulSoup: how to keep HTML entity, &qout;

Question

From https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters, it says

If you give Beautiful Soup a document that contains HTML entities like “&lquot;”, they’ll be converted to Unicode characters:

soup = BeautifulSoup("&ldquo ; Wow!&rdquo ; he said.", 'html.parser')

str(soup)

'“Wow!” he said.'

Is there any way to modify this behavior and make it preserve entities like '&dlquo ;', '&rdquo ;' or '&quot ;' in string processing with BeautifulSoup for html or xml?

orlp orlp · Accepted Answer · 2020-12-07T04:47:27

Did you try reading the rest of that documentation section? You can get the entities back by passing formatter="html" to soup.encode:

>>> soup.encode(formatter="html")
b'&ldquo; ; Wow!&rdquo; ; he said.'

Another way is to replace & with & before passing to BeautifulSoup:

>>> html = "&ldquo ; Wow!&rdquo ; he said."
>>> soup = BeautifulSoup(html.replace("&", "&amp;"), 'html.parser')
>>> print(soup.get_text())
&ldquo ; Wow!&rdquo ; he said.

BeautifulSoup: how to keep HTML entity, &qout;

1 Answers