Best practice for allowing Markdown in Python, while preventing XSS attacks?

Question

I need to let users enter Markdown content to my web app, which has a Python back end. I don’t want to needlessly restrict their entries (e.g. by not allowing any HTML, which goes against the spirit and spec of Markdown), but obviously I need to prevent cross-site scripting (XSS) attacks.

I can’t be the first one with this problem, but didn’t see any SO questions with all the keywords “python,” “Markdown,” and “XSS”, so here goes.

What’s a best-practice way to process Markdown and prevent XSS attacks using Python libraries? (Bonus points for supporting PHP Markdown Extra syntax.)

"Python back end"? What does this mean, exactly? If you're supporting markdown, all HTML can be quoted with <pre>. — S.Lott
You could test your app against the XSS cheat sheet at ha.ckers.org/xss.html — jfs
@S.Lott: Meaning the server-side scripting is in Python. <pre> isn’t exactly the solution. Markdown is what we use here on SO to write comments and questions… and the only time it results in a <pre> block is when you specifically request a code block (by indenting). — Alan H.
@S.Lott No worries, I’m asking this question to people who already know about how Markdown works and what a back-end is. — Alan H.
@S.Lott I decline. I want this question to be fairly general and not bound to e.g. Django, App Engine, or Zope, etc. (I assume you don’t want me to define “server-side” or “Python back end”, but rather clarify which framework I may be using. After all, if you needed those defined, surely you wouldn’t know the answer.) — Alan H.

Alan H. Alan H. · Accepted Answer · 2011-03-19T00:51:25

I was unable to determine “best practice,” but generally you have three choices when accepting Markdown input:

Allow HTML within Markdown content (this is how Markdown originally/officially works, but if treated naïvely, this can invite XSS attacks).
Just treat any HTML as plain text, essentially letting your Markdown processor escape the user’s input. Thus … in input will not create small text but rather the literal text “…”.
Throw out all HTML tags within Markdown. This is pretty user-hostile and may choke on text like <3 depending on implementation. This is the approach taken here on Stack Overflow.

My question regards case #1, specifically.

Given that, what worked well for me is sending user input through

Markdown for Python, which optionally supports Extra syntax and then through
html5lib’s sanitizer.

I threw a bunch of XSS attack attempts at this combination, and all failed (hurray!); but using benign tags like  worked flawlessly.

This way, you are in effect going with option #1 (as desired) except for potentially dangerous or malformed HTML snippets, which are treated as in option #2.

(Thanks to Y.H Wong for pointing me in the direction of that Markdown library!)

Best practice for allowing Markdown in Python, while preventing XSS attacks?

2 Answers