0
votes

I have a grammar that needs to process comments starting with '{* and ending at *}' at any point in the input stream. Also it needs to process template markers which start with '{' followed by a '$' or and identifier and end on a '}' and pass everything else through as text.

The only way to achieve this seem to be is to pass any thing that isn't a comment or a token back to the parser as individual characters and let the parser build the string. This is incredibly inefficient as the parser has to build a node for every character that it receives and then I have to walk the nodes an build a string from them. I would be a lot simpler an faster if the lexer could just return the text as a large string.

On an I7 running the program as a 32bit #C program on a 90K text file with no tokens or comments, just text, it takes about 15 minutes before it crashes with and out on memory exception.

The grammar basically is

Parser:
text: ANY_CHAR+;

Lexer:

COMMENT: '{*' .*? '*}' -> skip;

... Token Definitions .....

ANY_CHAR: [ -~];

If I try to accumulate the text in the lexer it swallows everything and doesn't recognize the comments or tokens because something like ANY_CHAR+ matches everything and returns comments and template markers in the string.

Does anybody know a way around this problem? At the moment it looks like I have to hand write a lexer.

1

1 Answers

0
votes

Yes, that is inefficient, but also not the way to do it. The solution is completely in lexer.

I understood that you want to detect comments, template markers and text. For this, you should use lexer modes. Every time you hit "{" go into some lexer mode, say MODE1 where you can detect only "*" or "$" or (since I didn't understand what you meant by '{' followed by a '$' or and identifier) something else, and depending on what you hit go into MODE2 or MODE3. After that (MODE2 or MODE3) wait for '}' and switch back to default mode. Of course, there is the possibility to make even more modes in between, depends on what you want do to, but for what I've just written:

  • MODE1 would be in which you determine if you area now detecting comment or template marker. Only two tokens in this mode '' and everything else. If it's '' go to MODE2, if anything else go to MODE3
  • MODE2 there is only one token here that you need and that is COMMENT,but you also need to detect '*}' or '}' (depending how you want to handle it)
  • MODE3 similarly as MODE2 - detect what you need and have a token that will switch back to default mode.