Can <script>
tags and all of their contents be removed from HTML with BeautifulSoup, or do I have to use Regular Expressions or something else?
102
votes
3 Answers
176
votes
41
votes
Updated answer for those who might need for future reference:
The correct answer is.
decompose()
.
You can use different ways but decompose
works in place.
Example usage:
soup = BeautifulSoup('<p>This is a slimy text and <i> I am slimer</i></p>')
soup.i.decompose()
print str(soup)
#prints '<p>This is a slimy text and</p>'
Pretty useful to get rid of detritus like <script>
, <img>
and so forth.
26
votes
As stated in the (official documentation) you can use the extract
method to remove all the subtree that matches the search.
import BeautifulSoup
a = BeautifulSoup.BeautifulSoup("<html><body><script>aaa</script></body></html>")
[x.extract() for x in a.findAll('script')]