- convert markdown to html
- sanitize html (w/whitelist)
- insert into database
Here, the assumptions are
- Given dangerous HTML, the sanitizer can produce safe HTML.
- The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.
- sanitize markdown (remove all tags - no exceptions)
- convert to html
- insert into database
Here the assumptions are
- Given dangerous markdown, the sanitizer can produce markdown that when converted to HTML by a different program will be safe.
- The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.
The markdown sanitizer has to know not just about dangerous HTML and dangerous markdown, but how the markdown->HTML converter does its job. That makes it more complex, and more likely to be wrong than the simpler unsafeHTML->safeHTML function above.
As a concrete example, "remove all tags" assumes you can identify tags, and would not work against UTF-7 attacks. There might be other encoding attacks out there that render this assumption moot, or there might be a bug that causes the markdown->HTML program to convert (full-width '<', exotic white-space characters stripped by markdown, SCRIPT) into a <script>
tag.
The most secure would be:
- sanitize markdown (remove all tags - no exceptions)
- convert markdown to HTML
- sanitize HTML
- insert into a DB column marked risky
- re-sanitize HTML every time you fetch that column from the DB
That way, when you update your HTML sanitizer you get protection against any newly discovered attacks. This is often inefficient, but you can get pretty good security by storing a timestamp with HTML inserted so that you can tell which might have been inserted during the time when someone knew about an attack that gets past your sanitizer.