9
votes

I am having problems with the nested '+'/'-' lookahead/lookbehind in regex.

Let's say that I want to change the '*' in a string with '%' and let's say that '\' escapes the next character. (Turning a regex to sql like command ^^).

So the string

  • '*test*' should be changed to '%test%',
  • '\\*test\\*' -> '\\%test\\%', but
  • '\*test\*' and '\\\*test\\\*' should stay the same.

I tried:

(?<!\\)(?=\\\\)*\*      but this doesn't work
(?<!\\)((?=\\\\)*\*)    ...
(?<!\\(?=\\\\)*)\*      ...
(?=(?<!\\)(?=\\\\)*)\*  ...

What is the correct regex that will match the '*'s in examples given above?

What is the difference between (?<!\\(?=\\\\)*)\* and (?=(?<!\\)(?=\\\\)*)\* or if these are essentially wrong the difference between regex that have such a visual construction?

5
What language do you use? And do you really expect that \*test\* stays the same and is not turned into *test*?Gumbo

5 Answers

11
votes

To find an unescaped character, you would look for a character that is preceded by an even number of (or zero) escape characters. This is relatively straight-forward.

(?<=(?<!\\)(?:\\\\)*)\*        # this is explained in Tim Pietzcker' answer

Unfortunately, many regex engines do not support variable-length look-behind, so we have to substitute with look-ahead:

(?=(?<!\\)(?:\\\\)*\*)(\\*)\*  # also look at ridgerunner's improved version

Replace this with the contents of group 1 and a % sign.

Explanation

(?=           # start look-ahead
  (?<!\\)     #   a position not preceded by a backslash (via look-behind)
  (?:\\\\)*   #   an even number of backslashes (don't capture them)
  \*          #   a star
)             # end look-ahead. If found,
(             # start group 1
  \\*         #   match any number of backslashes in front of the star
)             # end group 1
\*            # match the star itself

The look-ahead makes sure only even numbers of backslashes are taken into account. Anyway, there is no way around matching them into a group, since the look-ahead does not advance the position in the string.

9
votes

Ok, since Tim decided to not update his regex with my suggested mods (and Tomalak's answer is not as streamlined), here is my recommended solution:

Replace: ((?<!\\)(?:\\\\)*)\* with $1%

Here it is in the form of a commented PHP snippett:

// Replace all non-escaped asterisks with "%".
$re = '%             # Match non-escaped asterisks.
    (                # $1: Any/all preceding escaped backslashes.
      (?<!\\\\)      # At a position not preceded by a backslash,
      (?:\\\\\\\\)*  # Match zero or more escaped backslashes.
    )                # End $1: Any preceding escaped backslashes.
    \*               # Unescaped literal asterisk.
    %x';
$text = preg_replace($re, '$1%', $text);

Addendum: Non-lookaround JavaScript Solution

The above solution does require lookbehind, so it will not work in JavaScript. The following JavaScript solution does not use lookbehind:

text = text.replace(/(\\[\S\s])|\*/g,
    function(m0, m1) {
        return m1 ? m1 : '%';
    });

This solution replaces each instance of backslash-anything with itself, and each instance of * asterisk with a % percent sign.

Edit 2011-10-24: Fixed Javascript version to correctly handle cases such as: **text**. (Thanks to Alan Moore for pointing out the error in previous version.)

5
votes

Others have shown how this can be done with a lookbehind, but I'd like to make a case for not using lookarounds at all. Consider this solution (demo here):

s/\G([^*\\]*(?:\\.[^*\\]*)*)\*/$1%/g;

The bulk of the regex, [^*\\]*(?:\\.[^*\\]*)*, is an example of Friedl's "unrolled loop" idiom. It consumes as many as it can of individual characters other than asterisk or backslash, or pairs of characters consisting of a backslash followed by anything. That allows it to avoid consuming unescaped asterisks, no matter how many escaped backslashes (or other characters) precede them.

The \G anchors each match to the position where the previous match ended, or to the beginning of the input if this is the first match attempt. This prevents the regex engine from simply skipping over escaped backslashes and matching the unescaped asterisks anyway. So, each iteration of the /g controlled match consumes everything up to the next unescaped asterisk, capturing all but the asterisk in group #1. Then that's plugged back in and the * is replaced with %.

I think this is at least as readable as the lookaround approaches, and easier to understand. It does require support for \G, so it won't work in JavaScript or Python, but it works just fine in Perl.

4
votes

So you essentially want to match * only if it's preceded by an even number of backslashes (or, in other words, if it isn't escaped)? Then you don't need lookahead at all since you're only looking back, aren't you?

Search for

(?<=(?<!\\)(?:\\\\)*)\*

and replace with %.

Explanation:

(?<=       # Assert that it's possible to match before the current position...
 (?<!\\)   # (unless there are more backslashes before that)
 (?:\\\\)* # an even number of backslashes
)          # End of lookbehind
\*         # Then match an asterisk
0
votes

The problem of detecting escaped backslashes in regex has fascinated me for a while, and it wasn't until recently that I realized I was completely overcomplicating it. There are a couple of things that make it simpler, and as far as I can tell nobody here has noticed them yet:

  • Backslashes escape any character after them, not just other backslashes. So (\\.)* will eat an entire chain of escaped characters, whether they're backslashes or not. You don't have to worry about even- or odd-numbered slashes; just check for a solitary \ at the beginning or end of the chain (ridgerunner's JavaScript solution does take advantage of this).

  • Lookarounds aren't the only way to make sure you start with the first backslash in a chain. You can just look for a non-backslash character (or the start of the string).

The result is a short, simple pattern that doesn't need lookarounds or callbacks, and it's shorter than anything else I see so far.

/(?!<\\)(\\.)*\*/g

And the replacement string:

"$1%"

This works in .NET, which allows lookbehinds, and it should work for you in Perl. It's possible to do it in JavaScript, but without lookbehinds or the \G anchor, I can't see a way to do it in a one-liner. Ridgerunner's callback should work, as will a loop:

var regx = /(^|[^\\])(\\.)*\*/g;
while (input.match(regx)) {
    input = input.replace(regx, '$1$2%');
}

There are a lot of names here I recognize from other regex questions, and I know some of you are smarter than me. If I've made a mistake, please say so.