How do I fix this multiline regular expression in Ruby?

Question

I have a regular expression in Ruby that isn't working properly in multiline mode.

I'm trying to convert Markdown text into the Textile-eque markup used in Redmine. The problem is in my regular expression for converting code blocks. It should find any lines leading with 4 spaces or a tab, then wrap them in pre tags.

markdownText = '# header

some text that precedes code

    var foo = 9;
    var fn = function() {}

    fn();

some post text'

puts markdownText.gsub!(/(^(?:\s{4}|\t).*?$)+/m,"<pre>\n\\1\n</pre>")

Intended result:

# header

some text that precedes code

<pre>
    var foo = 9;
    var fn = function() {}

    fn();
</pre>

some post text

The problem is that the closing pre tag is printed at the end of the document instead of after "fn();". I tried some variations of the following expression but it doesn't match:

gsub!(/(^(?:\s{4}|\t).*?$)+^(\S)/m, "<pre>\n\\1\n</pre>\\2")

How do I get the regular expression to match just the indented code block? You can test this regular expression on Rubular here.

Why not include newline in your regex: ((?:\s{4}|\t).*?\n)+ — Mladen Jablanović
possible duplicate of RegEx match open tags except XHTML self-contained tags — Andrew Grimm
@Mladen Jablanovic I couldn't get your example to work using this code: puts markdownText.gsub!(/((?:\s{4}|\t).*?\n)+/,"<pre>\n\\1\n</pre>"). How would \n behave differently from $? — DonovanChan
That was just the regex to get the indented part (tried it in Rubular), not a complete working gsub-ready solution (hence, just a comment). — Mladen Jablanović

ridgerunner ridgerunner · Accepted Answer · 2011-04-19T16:50:16

First, note that 'm' multi-line mode in Ruby is equivalent to 's' single-line mode of other languages. In other words; 'm' mode in Ruby means: "dot matches all".

This regex will do a pretty good job of matching a markdown-like code section:

re = / # Match a MARKDOWN CODE section.
    (\r?\n)              # $1: CODE must be preceded by blank line
    (                    # $2: CODE contents
      (?:                # Group for multiple lines of code.
        (?:\r?\n)+       # Each line preceded by a newline,
        (?:[ ]{4}|\t).*  # and begins with four spaces or tab.
      )+                 # One or more CODE lines
      \r?\n              # CODE folowed by blank line.
    )                    # End $2: CODE contents
    (?=\r?\n)            # CODE folowed by blank line.
    /x
result = subject.gsub(re, '\1<pre>\2</pre>')

This requires a blank line before and after the code section and allows blank lines within the code section itself. It allows for either \r\n or \n line terminations. Note that this does not strip the leading 4 spaces (or tab) before each line. Doing that will require more code complexity. (I am not a ruby guy so can't help out with that.)

I would recommend looking at the markdown source itself to see how its really being done.

How do I fix this multiline regular expression in Ruby?

4 Answers