I am in the process of making a templating engine that is quite complex as it will feature typical constructs in programming languages such as if statements and loops.
Currently, I am working on the lexer, which I believe, deals with the job of converting a stream of characters into tokens. What I want to do is capture certain structures within the HTML document, which later can be worked on by the parser.
This is an example of the syntax:
<head>
<title>Template</title>
<meta charset="utf-8">
</head>
<body>
<h1>{{title}}</h1>
<p>This is also being matched.</p>
{{#myName}}
<p>My name is {{myName}}</p>
{{/}}
<p>This content too.</p>
{{^myName}}
<p>I have on name.</p>
{{/}}
<p>No matching here...</p>
</body>
I am trying to scan only for everything between the starting '{{' characters and ending '}}' characters. So, {{title}}
would be one match, along with {{#myName}}
, the text and content leading up to {{/}}, this should then be the second match.
I am not particularly the best at regular expressions, and I am pretty sure it is an issue with the pattern I have devised, which is this:
({{([#\^]?)([a-zA-Z0-9\._]+)}}([\w\W]+){{\/?}})
I read this as match two { characters, then either # or ^ any words containing uppercase or lowercase letters, along with any digits, dots, or underscores. Match anything that comes after the closing }} characters, until either the {{/}} characters are met, but the /}} part is optional.
The problem is visible in the link below. It is matching text that is not within the {{ and }} blocks. I am wondering it is linked to the use of the \w and \W, because if I specify specifically what characters I want to match against in the set, it seems to then work.
The regular expression test is here. I did look at the regular expression is the shared list for capturing all text that isn't HTML, and I noticed it is using lookaheads which I just cannot grasp, or understand why they would help me.
Can someone help me by pointing out the problem with the regular expression, or whether or not I am going the wrong way about it in terms of creating the lexer?
I hope I've provided enough information, and thank you for any help!
{{things}}
to tokens. For one thing, your example contains nested{{things}}
-- surely that cannot be a single token (and surely a regex will not suffice to capture that sort of structure). In fact, anything with structure should happen in the grammar, not the lexer. – tripleee