Are there other sequences browsers interpret as HTML special characters?

Question

In HTML, there are several special characters < > & ' " which have significance to the DOM parser. These are the characters the popular functions such as PHP's htmlspecialchars convert to HTML entities so they don't accidentally trigger something when parsed.

The translations performed are:

'&' (ampersand) becomes &

" (double quote) becomes " when ENT_NOQUOTES is not set.

' (single quote) becomes ' only when ENT_QUOTES is set.

'<' (less than) becomes <

'>' (greater than) becomes >

However, I remember that in older browsers like IE6, there were also other byte sequences that caused the browser's DOM parser to interpret content as HTML.

Is this still a problem today? If you filter these 5 alone is that enough to prevent XSS?

For example, here are all the known combinations of the character "<" in HTML and JavaScript (in UTF-8).

<
%3C
&lt
&lt;
&LT
&LT;
&#60
&#060
&#0060
&#00060
&#000060
&#0000060
&#60;
&#060;
&#0060;
&#00060;
&#000060;
&#0000060;
&#x3c
&#x03c
&#x003c
&#x0003c
&#x00003c
&#x000003c
&#x3c;
&#x03c;
&#x003c;
&#x0003c;
&#x00003c;
&#x000003c;
&#X3c
&#X03c
&#X003c
&#X0003c
&#X00003c
&#X000003c
&#X3c;
&#X03c;
&#X003c;
&#X0003c;
&#X00003c;
&#X000003c;
&#x3C
&#x03C
&#x003C
&#x0003C
&#x00003C
&#x000003C
&#x3C;
&#x03C;
&#x003C;
&#x0003C;
&#x00003C;
&#x000003C;
&#X3C
&#X03C
&#X003C
&#X0003C
&#X00003C
&#X000003C
&#X3C;
&#X03C;
&#X003C;
&#X0003C;
&#X00003C;
&#X000003C;
\x3c
\x3C
\u003c
\u003C

LoveAndCoding LoveAndCoding · Accepted Answer · 2011-12-24T19:15:39

No. I actually looked into this when I was researching using CSS and attributes to automatically assign styles based on content (my question), and the short answer is no. Modern browsers do not allow 'byte sequences' to be used as HTML. I use 'byte sequences' lightly because the most at risk code does not use byte encoded values.

The examples listed on the XSS site are about using attributes and having the javascript interpreted as a string that would need execution. But also listed is things like &{alert('XSS')} which runs the code within the brackets, and that code does not work in modern browsers.

But to answer your second question, no, filtering those 5 is not enough to prevent an XSS attack. Throw your code through the PHP HTML special characters code always but there a hundreds of byte codes that can be used and you won't really be able to guarantee anything. Sending it through a PHP filter (especially htmlentities()) will give you the exact text entered when you output it to HTML (IE « instead of «). That said, in most cases, depending your usage, using htmlspecialchars is enough to cover most attacks. Depends on how you will be using the input, but for the most part it will be safe.

XSS is a tricky thing to account for. A general good rule is always filter everything that a user will enter. And use white-listing instead of black-listing. What your talking about here would be black-listing these values, when it is always safer to assume your users are malicious and only allow certain things.

Are there other sequences browsers interpret as HTML special characters?

3 Answers