3
votes

I use Markdown for provide a simple way for write posts to my users in my forum script.
I'm trying to sanitize every user inputs, but I've a problem with Markdown's inputs.

I need to store in database the markdown text, not the HTML converted version, because users are allowed to edit their posts.

Basically I need something like what StackOverflow does.

I read this article about XSS vulnerability of Markdown. And the only solution I found is to use HTML_purifier before every output my script provides.

I think this can slowdown my script, I imagine output of 20 posts and running HTML_purifier for each one...

So I was trying to find a solution for sanitize from XSS vulnerabilities sanitizing the input instead of the output.

I can't run HTML_purifier on the input because my text is Markdown, not HTML. And if I convert it for get HTML I can't convert back for turn into Markdown.

I already remove (I hope) all HTML code with:

htmlspecialchars(strip_tags($text));

I've thinked about another solution:

When an user is trying to submit a new post: Convert the input from Markdown to HTML, run HTML_purifier, and if it find some XSS injection it simply return an error. But I don't know how to make this nor I know if HTML_purifier allows it.

I've found lot of questions about the same problem there, but all solutions was to store the input as HTML. I need to store as Markdown.

Someone has any advice?

3
1. Remove all tags from the input with strip_tags(), 2. sanitize input which are used in attributes like the [link](javascript:alert('xss')): 3. consider running htmlspecialchars() on the input before turning converting into HTML and output. Should be fairly safe no? I mean only you control what HTML tags are going to be used so as long as you strip them all in the input and sanitize attributes you got full control over XSS.kjetilh
It doesn't prevent the bug showed in my link.Fez Vrasta
Hm that's pretty discouraging if it doesn't detect tags over multiple lines..kjetilh
Can you explain your sentence pls? I don't understand it.Fez Vrasta
The vulnerability or bug I assume you refered to in your link says html tags are not properly stripped if you break it over multiple lines.kjetilh

3 Answers

7
votes
  1. Run Markdown on the input
  2. Run HTML Purifier on the HTML generated by Markdown. Configure it so it allows links, href attributes and so on (it should still strip javascript: commands)

// the nasty stuff :)
$content = "> hello <a name=\"n\" \n href=\"javascript:alert('xss')\">*you*</a>";

require '/path/to/markdown.php';

// at this point, the generated HTML is vulnerable to XSS
$content = Markdown($content);

require '/path/to//HTMLPurifier/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$config->set('Core.Encoding', 'UTF-8');
$config->set('HTML.Doctype', 'XHTML 1.0 Transitional');
$config->set('Cache.DefinitionImpl', null);

// put here every tag and attribute that you want to pass through
$config->set('HTML.Allowed', 'a[href|title],blockquote[cite]');

$purifier = new HTMLPurifier($config);

// here, the javascript command is stripped off
$content = $purifier->purify($content);

print $content;
1
votes

Solved...

$text = "> hello <a name=\"n\"
> href=\"javascript:alert('xss')\">*you*</a>";


$text = strip_tags($text);

$text = Markdown($text);

echo $text;

It return:

<blockquote>
  <p>hello  href="javascript:alert('xss')"&gt;<em>you</em></p>
</blockquote>

And not:

<blockquote>
  <p>hello <a name="n" href="javascript:alert('xss')"><em>you</em></a></p>
</blockquote>

So seems that strip_tags() does it works.

Merged with:

$text = preg_replace('/href=(\"|)javascript:/', "", $text);

The entire input should be sanitized from XSS injections. Correct me if I'm wrong.

0
votes

The html output of your markdown depends only on the md parser, so you can

  1. convert your md to html, and sanitize the html after that like described here:

    Escape from XSS vulnerability maintaining Markdown syntax?

  2. or you can modify your md parser to check every param which goes to html attribute for signs of xss. Ofc you should escape for html tags before parsing. I think this solution is much faster than the other, because by simple texts you should usually check only urls by images and links.