In HTML, there are several special characters < > & ' "
which have significance to the DOM parser. These are the characters the popular functions such as PHP's htmlspecialchars convert to HTML entities so they don't accidentally trigger something when parsed.
The translations performed are:
- '&' (ampersand) becomes
&
"
(double quote) becomes"
when ENT_NOQUOTES is not set.'
(single quote) becomes'
only when ENT_QUOTES is set.- '<' (less than) becomes
<
- '>' (greater than) becomes
>
However, I remember that in older browsers like IE6, there were also other byte sequences that caused the browser's DOM parser to interpret content as HTML.
Is this still a problem today? If you filter these 5 alone is that enough to prevent XSS?
For example, here are all the known combinations of the character "<" in HTML and JavaScript (in UTF-8).
<
%3C
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
\x3c
\x3C
\u003c
\u003C