6
votes

In HTML, there are several special characters < > & ' " which have significance to the DOM parser. These are the characters the popular functions such as PHP's htmlspecialchars convert to HTML entities so they don't accidentally trigger something when parsed.

The translations performed are:

  • '&' (ampersand) becomes &amp;
  • " (double quote) becomes &quot; when ENT_NOQUOTES is not set.
  • ' (single quote) becomes &#039; only when ENT_QUOTES is set.
  • '<' (less than) becomes &lt;
  • '>' (greater than) becomes &gt;

However, I remember that in older browsers like IE6, there were also other byte sequences that caused the browser's DOM parser to interpret content as HTML.

Is this still a problem today? If you filter these 5 alone is that enough to prevent XSS?

For example, here are all the known combinations of the character "<" in HTML and JavaScript (in UTF-8).

<
%3C
&lt
&lt;
&LT
&LT;
&#60
&#060
&#0060
&#00060
&#000060
&#0000060
&#60;
&#060;
&#0060;
&#00060;
&#000060;
&#0000060;
&#x3c
&#x03c
&#x003c
&#x0003c
&#x00003c
&#x000003c
&#x3c;
&#x03c;
&#x003c;
&#x0003c;
&#x00003c;
&#x000003c;
&#X3c
&#X03c
&#X003c
&#X0003c
&#X00003c
&#X000003c
&#X3c;
&#X03c;
&#X003c;
&#X0003c;
&#X00003c;
&#X000003c;
&#x3C
&#x03C
&#x003C
&#x0003C
&#x00003C
&#x000003C
&#x3C;
&#x03C;
&#x003C;
&#x0003C;
&#x00003C;
&#x000003C;
&#X3C
&#X03C
&#X003C
&#X0003C
&#X00003C
&#X000003C
&#X3C;
&#X03C;
&#X003C;
&#X0003C;
&#X00003C;
&#X000003C;
\x3c
\x3C
\u003c
\u003C
3

3 Answers

4
votes

No. I actually looked into this when I was researching using CSS and attributes to automatically assign styles based on content (my question), and the short answer is no. Modern browsers do not allow 'byte sequences' to be used as HTML. I use 'byte sequences' lightly because the most at risk code does not use byte encoded values.

The examples listed on the XSS site are about using attributes and having the javascript interpreted as a string that would need execution. But also listed is things like &{alert('XSS')} which runs the code within the brackets, and that code does not work in modern browsers.

But to answer your second question, no, filtering those 5 is not enough to prevent an XSS attack. Throw your code through the PHP HTML special characters code always but there a hundreds of byte codes that can be used and you won't really be able to guarantee anything. Sending it through a PHP filter (especially htmlentities()) will give you the exact text entered when you output it to HTML (IE &laquo; instead of «). That said, in most cases, depending your usage, using htmlspecialchars is enough to cover most attacks. Depends on how you will be using the input, but for the most part it will be safe.

XSS is a tricky thing to account for. A general good rule is always filter everything that a user will enter. And use white-listing instead of black-listing. What your talking about here would be black-listing these values, when it is always safer to assume your users are malicious and only allow certain things.

1
votes

Here is an example: <button onclick="confirm('Are you sure you want to delete &#39;);alert(&#39;xss')> Here the attackers input is what comes after "delete" and before ')>

This escaping will not work in this case, because we escaped for the wrong context.

In short xss prevention means escaping for the given context. In the above example we are in a javascript context within a HTML attribute context. See the OWASP XSS prevention cheat sheet.

1
votes

It suffices to escape text in HTML, but there are contexts in HTML where even text is dangerous:

  • don't allow users to create arbitrary URLs (in <a>, <img>, etc.), as they can insert javascript: or many variations of it. Whitelist only ^https?://.

  • HTML-escaping doesn't suffice in <script> (it use entity-escaping anyway) or in attributes that execute a script (onclick, etc). For those you need json_encode().