Is there a regular expression to detect a valid regular expression?

Question

Is it possible to detect a valid regular expression with another regular expression? If so please give example code below.

So your problem is validating a regex, you chose a regex for solving it. I wonder if the problem-number-increasing property of regexes is additive or multiplicative. It feels like 4 problems instead of 2 :) — abesto
There are many notations for regular expressions - some features and their spellings are common to most, some are spelled differently or only available in one particular notation. Most of those notations aren't "regular" in the regular grammar sense - you'd need a context free parser to handle the unbounded nesting of subexpressions - though many modern "regular expression" notations have extensions that go beyond the original formal definition and might allow their own notations to be recognized. In any case, why not simply ask your regex library if each regex is valid? — Steve314
@bevacqua i need to validate regexp in XML schema. How can i do it without another regexp? — zenden2k
Actually compile/run the regex (pattern) to be checked, under an exception-handling mechanism that your language has. So the language's regex engine/compiler itself will check it. (This assumes correct basic syntax so that the program runs, but that can be included in the check by using your languages' facilities to evaluate the string for the regex as (possibly syntactically wrong) code, or such .) — zdim
This is the perfect answer for python users: stackoverflow.com/questions/19630994/… — gianni

Markus Jarderot Markus Jarderot · Accepted Answer · 2008-10-05T17:15:56

/
^                                             # start of string
(                                             # first group start
  (?:
    (?:[^?+*{}()[\]\\|]+                      # literals and ^, $
     | \\.                                    # escaped characters
     | \[ (?: \^?\\. | \^[^\\] | [^\\^] )     # character classes
          (?: [^\]\\]+ | \\. )* \]
     | \( (?:\?[:=!]|\?<[=!]|\?>)? (?1)?? \)  # parenthesis, with recursive content
     | \(\? (?:R|[+-]?\d+) \)                 # recursive matching
     )
    (?: (?:[?+*]|\{\d+(?:,\d*)?\}) [?+]? )?   # quantifiers
  | \|                                        # alternative
  )*                                          # repeat content
)                                             # end first group
$                                             # end of string
/

This is a recursive regex, and is not supported by many regex engines. PCRE based ones should support it.

Without whitespace and comments:

/^((?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>)?(?1)??\)|\(\?(?:R|[+-]?\d+)\))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*)$/

.NET does not support recursion directly. (The (?1) and (?R) constructs.) The recursion would have to be converted to counting balanced groups:

^                                         # start of string
(?:
  (?: [^?+*{}()[\]\\|]+                   # literals and ^, $
   | \\.                                  # escaped characters
   | \[ (?: \^?\\. | \^[^\\] | [^\\^] )   # character classes
        (?: [^\]\\]+ | \\. )* \]
   | \( (?:\?[:=!]
         | \?<[=!]
         | \?>
         | \?<[^\W\d]\w*>
         | \?'[^\W\d]\w*'
         )?                               # opening of group
     (?<N>)                               #   increment counter
   | \)                                   # closing of group
     (?<-N>)                              #   decrement counter
   )
  (?: (?:[?+*]|\{\d+(?:,\d*)?\}) [?+]? )? # quantifiers
| \|                                      # alternative
)*                                        # repeat content
$                                         # end of string
(?(N)(?!))                                # fail if counter is non-zero.

Compacted:

^(?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>|\?<[^\W\d]\w*>|\?'[^\W\d]\w*')?(?<N>)|\)(?<-N>))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*$(?(N)(?!))

From the comments:

Will this validate substitutions and translations?

It will validate just the regex part of substitutions and translations. s/<this part>/.../

It is not theoretically possible to match all valid regex grammars with a regex.

It is possible if the regex engine supports recursion, such as PCRE, but that can't really be called regular expressions any more.

Indeed, a "recursive regular expression" is not a regular expression. But this an often-accepted extension to regex engines... Ironically, this extended regex doesn't match extended regexes.

"In theory, theory and practice are the same. In practice, they're not." Almost everyone who knows regular expressions knows that regular expressions does not support recursion. But PCRE and most other implementations support much more than basic regular expressions.

using this with shell script in the grep command , it shows me some error.. grep: Invalid content of {} . I am making a script that could grep a code base to find all the files that contain regular expressions

This pattern exploits an extension called recursive regular expressions. This is not supported by the POSIX flavor of regex. You could try with the -P switch, to enable the PCRE regex flavor.

Regex itself "is not a regular language and hence cannot be parsed by regular expression..."

This is true for classical regular expressions. Some modern implementations allow recursion, which makes it into a Context Free language, although it is somewhat verbose for this task.

I see where you're matching []()/\. and other special regex characters. Where are you allowing non-special characters? It seems like this will match ^(?:[\.]+)$, but not ^abcdefg$. That's a valid regex.

[^?+*{}()[\]\\|] will match any single character, not part of any of the other constructs. This includes both literal (a - z), and certain special characters (^, $, .).

Is there a regular expression to detect a valid regular expression?

9 Answers