Trouble with regular expression matching in lexer

Question

I am in the process of making a templating engine that is quite complex as it will feature typical constructs in programming languages such as if statements and loops.

Currently, I am working on the lexer, which I believe, deals with the job of converting a stream of characters into tokens. What I want to do is capture certain structures within the HTML document, which later can be worked on by the parser.

This is an example of the syntax:

<head>

    <title>Template</title>
    <meta charset="utf-8">

</head>

<body>

    <h1>{{title}}</h1>

    <p>This is also being matched.</p>

    {{#myName}}
        <p>My name is {{myName}}</p>
    {{/}}

    <p>This content too.</p>

    {{^myName}}
        <p>I have on name.</p>
    {{/}}

    <p>No matching here...</p>

</body>

I am trying to scan only for everything between the starting '{{' characters and ending '}}' characters. So, {{title}} would be one match, along with {{#myName}}, the text and content leading up to {{/}}, this should then be the second match.

I am not particularly the best at regular expressions, and I am pretty sure it is an issue with the pattern I have devised, which is this:

({{([#\^]?)([a-zA-Z0-9\._]+)}}([\w\W]+){{\/?}})

I read this as match two { characters, then either # or ^ any words containing uppercase or lowercase letters, along with any digits, dots, or underscores. Match anything that comes after the closing }} characters, until either the {{/}} characters are met, but the /}} part is optional.

The problem is visible in the link below. It is matching text that is not within the {{ and }} blocks. I am wondering it is linked to the use of the \w and \W, because if I specify specifically what characters I want to match against in the set, it seems to then work.

The regular expression test is here. I did look at the regular expression is the shared list for capturing all text that isn't HTML, and I noticed it is using lookaheads which I just cannot grasp, or understand why they would help me.

Can someone help me by pointing out the problem with the regular expression, or whether or not I am going the wrong way about it in terms of creating the lexer?

I hope I've provided enough information, and thank you for any help!

You are matching way too much. You should simply be converting the individual {{things}} to tokens. For one thing, your example contains nested {{things}} -- surely that cannot be a single token (and surely a regex will not suffice to capture that sort of structure). In fact, anything with structure should happen in the grammar, not the lexer. — tripleee
@tripleee I intend on doing that, but I wanted to capture them all first, then break them down further, as I'm not interested in the other content. — Mark

Casimir et Hippolyte Casimir et Hippolyte · Accepted Answer · 2013-12-15T18:57:35

Your pattern doesn't work because [\w\W]+ take all possible characters until the last {{/}} of your string. Quantifiers (i.e. +, *, {1,3}, ?) are greedy by default. To obtain a lazy quantifier you must add a ? after it: [\w\W]+?

A pattern to deal with nested structures:

$pattern = <<<'LOD'
~
{{
(?|                  # branch reset group: the interest of this feature is that
                     # capturing group numbers are the same in all alternatives
    ([\w.]++)}}      # self-closing tag: capturing group 1: tag name
  |                  # OR
    ([#^][\w.]++)}}  # opening tag:      capturing group 1: tag name
    (                # capturing group 2: content
        (?>          # atomic group: three possible content type
            [^{]++   # all characters except { 
          |          # OR
            {(?!{)   # { not followed by another {
          |          # OR
            (?R)     # an other tag is met, attempt the whole pattern again
        )*           # repeat the atomic group 0 or more times
    )                # close the second capturing group
    {{/}}            # closing tag
)                    # close the branch reset group
~x
LOD;

preg_match_all($pattern, $html, $matches);

var_dump($matches);

To obtain all nested levels you can use this pattern:

$pattern = <<<'LOD'
~
(?=(                            # open a lookahead and the 1st capturing group
    {{
    (?|
        ([\w.]++)}}
      |
        ([#^][\w.]++)}}
        (                       # ?R was changed to ?1 because I don't want to
        (?>[^{]++|{(?!{)|(?1))* # repeat the whole pattern but only the
        )                       # subpattern in the first capturing group
        {{/}}
    )
)                               # close the 1st capturing group 
)                               # and the lookahead
~x
LOD;

preg_match_all($pattern, $html, $matches);

var_dump($matches);

This pattern is only the first pattern enclosed in a lookahead and a capturing group. This construct allows to capture overlapping substrings.

More informations about regex features used in these two patterns:

possessive quantifiers ++

atomic groups (?>..)

lookahead (?=..), (?!..)

branch reset group (?|..|..)

recursion (?R), (?1)

Trouble with regular expression matching in lexer

1 Answers