I am saving user-submitted HTML (in a database). I must prevent JavaScript injection attacks. The most pernicious I have seen is JavaScript in a style="expression(...)"
.
In addition to this, a fair amount of valid user content will include special characters and XML constructs, so I'd like to avoid a white-list approach if possible. (Listing every allowable HTML element and attribute).
Examples of JavaScript attack strings:
1.
"Hello, I have a
<script>alert("bad!")</script>
problem with the <dog>
element..."
"Hi, this <b
style="width:expression(alert('bad!'))">dog</b>
is black."
Is there a way to prevent such JavaScript, and leave the rest intact?
The only solution I have so far is to use a regular expression to remove certain patterns. It solves case 1, but not case 2.
The environment is essentially the Microsoft stack:
- SQL Server 2005
- C# 3.5 (ASP.NET)
- JavaScript and jQuery.
I would like the chokepoint to be the ASP.NET layer - anyone can craft a bad HTTP request.
Edit
Thanks for the links, everyone. Assuming that I can define my list (the content will include many mathematical and programming constructs, so a whitelist is going to be very annoying), I still have a question:
What kind of parser will allow me to just remove the "bad" parts? The bad part could be an entire element, but then what about those scripts that reside in the attributes? I can't remove < a hrefs >
willy-nilly.