Lexer DFA results in "code too large" error
I'm trying to parse Java Server Pages using ANTLR 3.
Java has a limit of 64k for the byte code of a single method, and I keep running into a "code too large" error when compiling the Java source generated by ANTLR.
In some cases, I've been able to fix it by compromising my lexer. For example, JSP uses the XML "Name" token, which can include a wide variety of characters. I decided to accept only ASCII characters in my "Name" token, which drastically simplified some tests in the and lexer allowed it to compile.
However, I've gotten to the point where I can't cut any more corners, but the DFA is still too complex.
What should I do about it?
Are there common mistakes that result in complex DFAs?
Is there a way to inhibit generation of the DFA, perhaps relying on semantic predicates or fixed lookahead to help with the prediction?
Writing this lexer by hand will be easy, but before I give up on ANTLR, I want to make sure I'm not overlooking something obvious.
Background
ANTLR 3 lexers use a DFA to decide how to tokenize input. In the generated DFA, there is a method called specialStateTransition()
. This method contains a switch
statement with a case for each state in the DFA. Within each case, there is a series of if
statements, one for each transition from the state. The condition of each if
statement tests an input character to see if it matches the transition.
These character-testing conditions can be very complex. They normally have the following form:
int ch = … ; /* "ch" is the next character in the input stream. */
switch(s) { /* "s" is the current state. */
…
case 13 :
if ((('a' <= ch) && (ch <= 'z')) || (('A' <= ch) && (ch <= 'Z')) || … )
s = 24; /* If the character matches, move to the next state. */
else if …
A seemingly minor change to my lexer can result in dozens of comparisons for a single transition, several transitions for each state, and scores of states. I think that some of the states being considered are impossible to reach due to my semantic predicates, but it seems like semantic predicates are ignored by the DFA. (I could be misreading things though—this code is definitely not what I'd be able to write by hand!)
I found an ANTLR 2 grammar in the Jsp2x tool, but I'm not satisfied with its parse tree, and I want to refresh my ANTLR skills, so I thought I'd try writing my own. I am using ANTLRWorks, and I tried to generate graphs for the DFA, but there appear to be bugs in ANTLRWorks that prevent it.
specialStateTransition
. – Gunther