How can I simplify token prediction DFA?

Question

Lexer DFA results in "code too large" error

I'm trying to parse Java Server Pages using ANTLR 3.

Java has a limit of 64k for the byte code of a single method, and I keep running into a "code too large" error when compiling the Java source generated by ANTLR.

In some cases, I've been able to fix it by compromising my lexer. For example, JSP uses the XML "Name" token, which can include a wide variety of characters. I decided to accept only ASCII characters in my "Name" token, which drastically simplified some tests in the and lexer allowed it to compile.

However, I've gotten to the point where I can't cut any more corners, but the DFA is still too complex.

What should I do about it?

Are there common mistakes that result in complex DFAs?

Is there a way to inhibit generation of the DFA, perhaps relying on semantic predicates or fixed lookahead to help with the prediction?

Writing this lexer by hand will be easy, but before I give up on ANTLR, I want to make sure I'm not overlooking something obvious.

Background

ANTLR 3 lexers use a DFA to decide how to tokenize input. In the generated DFA, there is a method called specialStateTransition(). This method contains a switch statement with a case for each state in the DFA. Within each case, there is a series of if statements, one for each transition from the state. The condition of each if statement tests an input character to see if it matches the transition.

These character-testing conditions can be very complex. They normally have the following form:

int ch = … ; /* "ch" is the next character in the input stream. */
switch(s) { /* "s" is the current state. */
  …
  case 13 :
    if ((('a' <= ch) && (ch <= 'z')) || (('A' <= ch) && (ch <= 'Z')) || … )
      s = 24; /* If the character matches, move to the next state. */
    else if …

A seemingly minor change to my lexer can result in dozens of comparisons for a single transition, several transitions for each state, and scores of states. I think that some of the states being considered are impossible to reach due to my semantic predicates, but it seems like semantic predicates are ignored by the DFA. (I could be misreading things though—this code is definitely not what I'd be able to write by hand!)

I found an ANTLR 2 grammar in the Jsp2x tool, but I'm not satisfied with its parse tree, and I want to refresh my ANTLR skills, so I thought I'd try writing my own. I am using ANTLRWorks, and I tried to generate graphs for the DFA, but there appear to be bugs in ANTLRWorks that prevent it.

See this question for a case of "code too large", where a grammar change completely removes specialStateTransition. — Gunther
@Gunther, ah, yeah, the predicate inside that grammar caused that! I completely forgot that Q&A! I'll make a favorite of it. — Bart Kiers

Bart Kiers Bart Kiers · Accepted Answer · 2011-09-22T17:44:00

Grammars that are very large (many different tokens) have that problem, unfortunately (SQL grammars suffer from this too).

Sometimes this can be fixed by making certain lexer rules fragments opposed to "full" lexer rules that produce tokens and/or re-arranging the way characters are matched inside the rules, but by looking at the way you already tried yourself, I doubt there can gained much in your case. However, if you're willing to post your lexer grammar here on SO, I, or someone else, might see something that could be changed.

In general, this problem is fixed by splitting the lexer grammar into 2 or more separate lexer grammars and then importing those in one "master" grammar. In ANTLR terms, these are called composite grammars. See this ANTLR Wiki page about them: http://www.antlr.org/wiki/display/ANTLR3/Composite+Grammars

EDIT

As @Gunther rightfully mentioned in the comment beneath the OP, see the Q&A: Why my antlr lexer java class is "code too large"? where a small change (the removal of a certain predicate) caused this "code too large"-error to disappear.