I have a grammar that needs to process comments starting with '{* and ending at *}' at any point in the input stream. Also it needs to process template markers which start with '{' followed by a '$' or and identifier and end on a '}' and pass everything else through as text.
The only way to achieve this seem to be is to pass any thing that isn't a comment or a token back to the parser as individual characters and let the parser build the string. This is incredibly inefficient as the parser has to build a node for every character that it receives and then I have to walk the nodes an build a string from them. I would be a lot simpler an faster if the lexer could just return the text as a large string.
On an I7 running the program as a 32bit #C program on a 90K text file with no tokens or comments, just text, it takes about 15 minutes before it crashes with and out on memory exception.
The grammar basically is
Parser:
text: ANY_CHAR+;
Lexer:
COMMENT: '{*' .*? '*}' -> skip;
... Token Definitions .....
ANY_CHAR: [ -~];
If I try to accumulate the text in the lexer it swallows everything and doesn't recognize the comments or tokens because something like ANY_CHAR+ matches everything and returns comments and template markers in the string.
Does anybody know a way around this problem? At the moment it looks like I have to hand write a lexer.