4
votes

I want to write a web application that allows users to enter any HTML that can occur inside a <div> element. This HTML will then end up being displayed to other users, so I want to make sure that the site doesn't open people up to XSS attacks.

Is there a nice library in Python that will clean out all the event handler attributes, <script> elements and other Javascript cruft from HTML or a DOM tree?

I am intending to use Beautiful Soup to regularize the HTML to make sure it doesn't contain unclosed tags and such. But, as far as I can tell, it has no pre-packaged way to strip all Javascript.

If there is a nice library in some other language, that might also work, but I would really prefer Python.

I've done a bunch of Google searching and hunted around on pypi, but haven't been able to find anything obvious.

Related

5
@J.F. Sebastian, the link is appreciated, and I will leave your edit of my post intact, but I think that it more properly belonged as a comment than a post edit. - Omnifarious
You're right. It is my habit from the days when comments didn't support links and were not googlable. - jfs

5 Answers

5
votes

As Klaus mentions, the clear consensus in the community is to use BeautifulSoup for these tasks:

soup = BeautifulSoup.BeautifulSoup(html)
for script_elt in soup.findAll('script'):
    script_elt.extract()
html = str(soup)
4
votes

Whitelist approach to allowed tags, attributes and their values is the only reliable way. Take a look at Recipe 496942: Cross-site scripting (XSS) defense

What is wrong with existing markup languages such as used on this very site?

0
votes

You could use BeautifulSoup. It allows you to traverse the markup structure fairly easily, even if it's not well-formed. I don't know that there's something made to order that works only on script tags.

0
votes

I would honestly look at using something like bbcode or some other alternative markup with it.

0
votes

Eric,

Have you thought about using a 'SAX' type parser for the HTML? I'm really not sure though that it would ignore the events properly though. It would also be a bit harder to construct than using something like Beautiful Soup. Handling syntax errors may be a problem with SAX as well.

What I like to do in situations like this is to construct python objects (subclassed from an XML_Element class) from the parsed HTML. Then remove any undesired objects from the tree, and finally re-serialize the objects back to html. It's not all that hard in python.

Regards,