23
votes

HTML 4 states pretty which characters should be escaped:

Four character entity references deserve special mention since they are frequently used to escape special characters:

  • "&lt;" represents the < sign.
  • "&gt;" represents the > sign.
  • "&amp;" represents the & sign.
  • "&quot; represents the " mark.

Authors wishing to put the "<" character in text should use "&lt;" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use "&gt;" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&amp;" in attribute values since character references are allowed within CDATA attribute values.

Some authors use the character entity reference "&quot;" to encode instances of the double quote mark (") since that character may be used to delimit attribute values.

I'm surprised I can't find anything like this in HTML 5. With the help of grep the only non-XML mention I could find comes as an aside regarding the deprecated XMP element:

Use pre and code instead, and escape "<" and "&" characters as "&lt;" and "&amp;" respectively.

Could somewhat point to the official source on this matter?

3
Characters need to be escaped when ambiguous. So, " in double-quoted attributes and ' in single-quoted attributes (obviously ambiguous), plus < in text outside of attributes (only ambiguous sometimes, but still causes validation errors). <b>2 > 1</b> is valid HTML5. & is also an error when ambiguous.Ry-♦
Thanks, but... I still feel this all makes sense but there is no normative section regarding it. HTML is not, after all, very inviting to "makes sense" guidance. (Say, <p> could unambiguously close all the open <em> and <strong> tags of the previous paragraph, etc.) Why this omission, while devoting time to "unless the first thing of the element is a comment". It feels like a major oversight.ezequiel-garzon
I’m not sure what that has to do with escaping rules, but automatically correcting unclosed tags to some sort of recognizable tree needs to exist for historical reasons.Ry-♦
I meant with my example that I wasn't looking for convincing reasons or common sense, since HTML has (say, unlike XML) a high degree of arbitrariness. Instead, I was looking for a source, which you kindly provided. Thanks again.ezequiel-garzon
Does this answer your question? Which characters need to be escaped in HTML?Dan Dascalescu

3 Answers

9
votes

The specification defines the syntax for normal elements as:

Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.

So you have to escape <, or & when followed by anything that could begin a character reference. The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. (Obviously, if you don’t want to terminate the attribute value there, escape the quotation mark.)

These rules don’t apply to <script> and <style>; you should avoid putting dynamic content in those. (If you have to include JSON in a <script>, replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialization.)

5
votes

From http://www.w3.org/html/wg/drafts/html/master/single-page.html#serializing-html-fragments

Escaping a string (for the purposes of the algorithm* above) consists of running the following steps:

  1. Replace any occurrence of the "&" character by the string "&amp;".
  2. Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string "&nbsp;".
  3. If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string "&quot;".
  4. If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "&lt;", and any occurrences of the ">" character by the string "&gt;".

*Algorithm is the built-in serialization algorithm as called e.g. by the innerHTML getter.

Strictly speaking, this is not exactly an aswer to your question, since it deals with serialization rather than parsing. But on the other hand, the serialized output is designed to be safely parsable. So, by implication, when writing markup:

  1. The & character should be replaced by &amp;
  2. Non-breaking spaces should be escaped as &nbsp; (surprise!...)
  3. Within attributes, " should be escaped as &quot;
  4. Outside of attributes, < should be escaped as &lt; and > should be escaped as &gt;

I'm intentionaly writing "should", not "must", since parsers may be able to correct violations of the above.

3
votes

Adding my voice to insist that things are not that easy -- strictly speaking:

Case 1 : HTML serialization

(the most common)

If you serialize your HTML5 as HTML, "the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand."

An ambiguous ampersand is an "ampersand followed by one or more alphanumeric ASCII characters, followed by a U+003B SEMICOLON character (;)"

Furthermore, "the parsing of certain named character references in attributes happens even with the closing semicolon being omitted."

So, in that case editable && copy (notice the spaces around &&) is valid HTML5 serialized as HTML construction as none of the ampersands is followed by a letter.

As a counter example: editable&&copy is not safe (even if this might work) as the last sequence &copy might be interpreted as the entity reference for ©

Case 1 : XML serialization

(the less common)

Here the classic XML rules apply. For example, each and every ampersand either in the text or in attributes should be escaped as &amp;.

In that case && (with or without spaces) is invalid XML. You should write &amp;&amp;

Tricky, isn't it ?