29
votes

I need to let users enter Markdown content to my web app, which has a Python back end. I don’t want to needlessly restrict their entries (e.g. by not allowing any HTML, which goes against the spirit and spec of Markdown), but obviously I need to prevent cross-site scripting (XSS) attacks.

I can’t be the first one with this problem, but didn’t see any SO questions with all the keywords “python,” “Markdown,” and “XSS”, so here goes.

What’s a best-practice way to process Markdown and prevent XSS attacks using Python libraries? (Bonus points for supporting PHP Markdown Extra syntax.)

2
"Python back end"? What does this mean, exactly? If you're supporting markdown, all HTML can be quoted with <pre>. - S.Lott
You could test your app against the XSS cheat sheet at ha.ckers.org/xss.html - jfs
@S.Lott: Meaning the server-side scripting is in Python. <pre> isn’t exactly the solution. Markdown is what we use here on SO to write comments and questions… and the only time it results in a <pre> block is when you specifically request a code block (by indenting). - Alan H.
@S.Lott No worries, I’m asking this question to people who already know about how Markdown works and what a back-end is. - Alan H.
@S.Lott I decline. I want this question to be fairly general and not bound to e.g. Django, App Engine, or Zope, etc. (I assume you don’t want me to define “server-side” or “Python back end”, but rather clarify which framework I may be using. After all, if you needed those defined, surely you wouldn’t know the answer.) - Alan H.

2 Answers

21
votes

I was unable to determine “best practice,” but generally you have three choices when accepting Markdown input:

  1. Allow HTML within Markdown content (this is how Markdown originally/officially works, but if treated naïvely, this can invite XSS attacks).

  2. Just treat any HTML as plain text, essentially letting your Markdown processor escape the user’s input. Thus <small>…</small> in input will not create small text but rather the literal text “<small>…</small>”.

  3. Throw out all HTML tags within Markdown. This is pretty user-hostile and may choke on text like <3 depending on implementation. This is the approach taken here on Stack Overflow.

My question regards case #1, specifically.

Given that, what worked well for me is sending user input through

  1. Markdown for Python, which optionally supports Extra syntax and then through
  2. html5lib’s sanitizer.

I threw a bunch of XSS attack attempts at this combination, and all failed (hurray!); but using benign tags like <strong> worked flawlessly.

This way, you are in effect going with option #1 (as desired) except for potentially dangerous or malformed HTML snippets, which are treated as in option #2.

(Thanks to Y.H Wong for pointing me in the direction of that Markdown library!)

2
votes

Markdown in Python is probably what you are looking for. It seems to cover a lot of your requested extensions too.

To prevent XSS attacks, the preferred way to do it is exactly the same as other languages - you escape the user output when rendered back. I just took a peek at the documentation and the source code. Markdown seems to be able to do it right out of the box with some trivial config tweaks.