1
votes

This is more of a lexer/parser design question:

Imagine having to lex/parse legacy templates (for static code analysis purposes) which are similar to html but with two additional features:

  • custom tags which have a certain prefix. '<xy...' for example
  • regions that contain a language expression delimited by a specific character. '#' for example. Regions execute code and render text or nothing as a result.

These kind of templates are processed by an appserver to produce webpages. Now, if i take the XMLLexer / XMLParser examples from the website as a base grammar, consider following rules:

  • custom tags are allowed everywhere in the html "sea". See below for some creative examples.
  • expression regions are also allowed everywhere in the html "sea", but only as attribute values inside of custom tags. See below for examples.

With these two basic rules in mind and using the XMLLexer as a base: i would add two additional lexer modes for both rules and configure them so that i can jump from mode to mode as needed.

Now here is the question: how would i define the "jumping" part?

The everywhere part seems hard to to, because i have to consider every parser rule and include the possibility of the expression region or a custom tag to start. I can't get my head around this. Is it possible to have a kind of 'global' rule that triggers no matter where you are in the lexing/parsing process? Are lexer modes even the right tool for this use case?

Code Examples:

Note: not all examples are useful or best practice but they are certainly possible and present in legacy code:

Examples of expression regions:

<#test# asdf="fff">ggg</#test#>
<te#st# asdf="fff">ggg</te#st#>
<t#es#t asdf="fff">ggg</t#es#t>
<#te#st asdf="fff">ggg</#te#st>
<test #asdf#="fff">ggg</test>
<test asdf="#fff#">ggg</test>
<test asdf="fff">#ggg#</test>

Use of custom tags:

<xyif condition="#yesorno#">
    Si
<xyelse>
    No
</xyif>

<div conditionalvalue="
    <xyif condition="#yesorno#">
        Si
    <xyelse>
        No
    </xyif>
"/>

<div
    <xyif condition="#conditionalattribute#">
        attributable="attributable"
    </xyif>
/>

<xyif condition="#showelement#">
    <div class="conditionalelement">I am conditional</div>
</xyif>

<xyif condition="#outerbox#">
    <div class="conditional_outerbox">
</xyif>
    <div class="innercontent">
        ...
    </div>
<xyif condition="#outerbox#">
    </div>
</xyif>
1

1 Answers

1
votes

You have the arbitrary macro-expansion problem. In essence, you have raw text ("sea of html") containing macros to be expanded into more "sea of text".

The problem is that the macros can occur in arbitrary places, violating any "structure" your sea might have; in fact, the invoked macro may be the source of apparently missing structure. (From your examples, the only concession is that you do not show macros replacing < > or " "). This is anathema to parsing, which is about extracting structure. People have been cracking their skull on this for a long time. You are not in for an easy ride.

Your real problem is that wherever such macros can occur, you can't quite be sure what the macro will produce without expanding it (that would make your problem back into conventional parsing). So what your parsing engine has to produce, if it produces anything, is the set of possible interpretations of the macro that lead to a valid parse. If the macros can introduce arbitrary structure, the number of possible interpretations gets enormous fast and you can't represent the answer in practical way. If the macros are constrained (in your case, can #test# generate a > character?) you might be able to represent with this with multiple parse trees, but conventional parser generators cannot handle this. You could consider GLR parsers for this instead.