5
votes

I'm developing a web app where users can response to blog entries. This is a security problem because they can send dangerous data that will be rendered to other users (and executed by javascript).

They can't format the text they send. No "bold", no colors, no nothing. Just simple text. I came up with this regex to solve my problem:

[^\\w\\s.?!()]

So anything that is not a word character (a-Z, A-Z, 0-9), not a whitespace, ".", "?", "!", "(" or ")" will be replaced with an empty string. Than every quatation mark will be replaced with: "&quot".

I check the data on the front end and I check it on my server.

Is there any way somebody could bypass this "solution"?

I'm wondering how StackOverflow does this thing? There are a lot of formatting here so they must do a good work with it.

6
What is your server side language?Pekka
You didn't say anything about <>, which is probably the most vital characters used in xss...rook

6 Answers

3
votes

If you just want simple text don't worry about filtering specific html tags. You want the equvilent to PHP's htmlspecialchars(). A good way to use this is print htmlspecialchars($var,ENT_QUOTES); This function will perform the following encodings:

'&' (ampersand) becomes '&amp;'
'"' (double quote) becomes '&quot;' when ENT_NOQUOTES is not set.
''' (single quote) becomes '&#039;' only when ENT_QUOTES is set.
'<' (less than) becomes '&lt;'
'>' (greater than) becomes '&gt;'

This is solving the problem of XSS at the lowest level, and you don't need some complex library/regex that you don't understand (and is probably insecure after all complexity is the enemy of security).

Make sure to TEST YOUR XSS FILTER by running a free xss scanner.

2
votes

I agree with Tomalak, and just wanted to add a few points.

  1. Don't allow HTML tags. The idea is to treat user input as text, and html-escape characters before rendering them. Use OWASP's ESAPI project for this purpose. This page explains the various possible encodings that you should be aware of.
  2. If you have to allow HTML tags, use a library to do the filtering for you. DO NOT write your own regexe's; they are difficult to get right. Use OWASP's Anti-Samy project - it was designed specifically for this use case.
2
votes

I'd recommend reading the XSS Prevention Cheat Sheet which details best practice for avoiding XSS attacks. Essentially, what you need to filter depends upon the context into which it will be used.

For example, in this kind of scenario:

<body>...ESCAPE UNTRUSTED DATA BEFORE PUTTING HERE...</body>

You need to do:

& --> &amp;
< --> &lt;
> --> &gt;
" --> &quot;
' --> &#x27;     &apos; is not recommended
/ --> &#x2F;     forward slash is included as it helps end an HTML entity

While, in the case of an href="" example you need to do a urlescape:

"Except for alphanumeric characters, escape all characters with ASCII values less than 256 with the %HH escaping format. Including untrusted data in data: URLs should not be allowed as there is no good way to disable attacks with escaping to prevent switching out of the URL. All attributes should be quoted. Unquoted attributes can be broken out of with many characters including [space] % * + , - / ; < = > ^ and |. Note that entity encoding is useless in this context."

While the cited article gives the full verdict, hopefully there's enough information in this answer to get you started.

1
votes
  1. Don't allow HTML tags.
  2. Don't output anything a user entered without HTML-escaping it first. This is the much more important point! Do this and you will not ever have an XSS problem.
  3. Provide a preview function so users can see what it will look like before posting.

If you must allow HTML tags, define a whitelist and check user input against it. You can even use regex for this.

Say you allow <p>, <a href="..."> and <img src="...">:

  1. find everything in the user string that matches <\S[^>]*>
  2. for every match, check it against <(p|a href="[^"]+"|img src="[^"]+")/?>|</(a|p)>
  3. if it does not fit that rigorous regex, throw it away.
  4. See point #2 above.
  5. Try hard to deliberately break your system. Ask others to try and break your system.
0
votes

The front end can be bypassed using Fiddler for instance by appending the form info. On the back end use html encoding e.g. <a> = &lt ;a&gt ;

This way the text will be displayed as text not html elements.

0
votes

Remove any bad character sequences first, e.g. overlong UTF-8, invalid Unicode.

You'll need to be more explicit whether < and > are stripped or turned into entities.

You'll also need to strip or encode double and single quotes, otherwise an attacker can add an intrinsic event where you didn't expect, e.g. <input name='comment' value='foo'onSomething=payload;a=''>

If you really want to allow some subset of HTML, be careful trying to parse it with regexes, especially ones you come up with yourself, e.g. browsers will render tricky tags<a b=">"onMouseOver=alert(42)> just fine where a regex might mismatch it. Check out the previously mentioned Anti-Samy .

If you're allowing HTML tags that have href or src attributes, make sure they point to http(s): schemes, not javascript: ones.