Antlr4 regular expression grammar, data structure for NFA transition table

Question

I apologize for the extremely long explanation but I'm stuck for a month now and I really can't figure out how to solve this. I have to derive, as a project, a compiler with antlr4 for a regex grammar that generate a program (JAVA) able to distinguish words belonging to the language generated by a regex used as input for antlr4 compiler. The grammar that we have to use is this one:

RE ::= union | simpleRE
union ::= simpleRE + RE
simpleRE ::= concatenation | basicRE
concatenation ::= basicRE simpleRE
basicRE ::= group | any | char
group ::= (RE) | (RE)∗ | (RE)+
any ::= ?
char ::= a | b | c | ··· | z | A | B | C | ··· | Z | 0 | 1 | 2 | ··· | 9 | . | − | _

and from that, I gave this grammar to antrl4

Regexp.g4

grammar Regxp;

start_rule              
    : re                            # start
    ;

re
    :    union                      
    | simpleRE                      
    ;

union 
    :    simpleRE '+' re            # unionOfREs
    ;

simpleRE
    :    concatenation                      
    | basicRE                               
    ;

concatenation
    :    basicRE simpleRE                   #concatOfREs
    ;

basicRE
    :    group                      
    | any                               
    | cHAR                              
    ;


group
    :  LPAREN re RPAREN '*'             # star
    |  LPAREN re RPAREN '+'             # plus
    |  LPAREN re RPAREN                 # singleWithParenthesis
    ;


any
    :   '?'                             
    ;


cHAR
    : CHAR              #singleChar
    ;

WS : [ \t\r\n]+ -> skip ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : LETTER | DIGIT | DOT | D | UNDERSCORE
    ;
/* tokens */
fragment LETTER:    [a-zA-Z]
    ;
fragment DIGIT: [0-9]
    ;
fragment DOT:  '.'
    ;
fragment D:  '-'
    ;
fragment UNDERSCORE: '_'
    ;

Then i generated the java files from antlr4 with visitors. As far as i understood the logic of the project, when the visitor is traversing the parse tree, it has to generate lines of code to fill the transition table of the NFA derived as applying the Thompson rules on the input regexp. Then these lines of code are to be saved as a .java text file, and compiled to a program that takes in input a string (word) and tells if the word belongs or not to the language generated by the regex. The result should be like this:

RE      word    Result
a+b       a       OK
          b       OK
         ac       KO

a∗b     aab       OK
         b        OK
       aaaab      OK
        abb       KO

So I'm asking, how can I represent the transition table in a way such that it can be filled during the visit of the parse tree and then exported in order to be used by a simple java program implementing the acceptance algorithm for an NFA? (i'm considering this pseudo-code):

S = ε−closure(s0);
c = nextChar();
while (c ≠ eof) do
S = ε−closure(move(S,c));
c = nextChar();
end while
if (S ∩ F ≠ ∅) then return “yes”;
else return “no”;
end if

As of now I managed to make that, when the visitor is for example in the unionOfREs rule, it will do something like this:

MyVisitor.java

private List<String> generatedCode = new ArrayList<String>();

/* ... */
@Override 
public String visitUnionOfREs(RegxpParser.UnionOfREsContext ctx) { 
    System.out.println("unionOfRExps");
    String char1 = visit(ctx.simpleRE());
    String char2 = visit(ctx.re());
    generatedCode.add("tTable.addUnion("+char1+","+char2+");");
    //then this line of code will populate the transition table
    return char1+"+"+char2;
}
/* ... */

The addUnion it's inside a java file that will contains all the methods to fill the transition table. I wrote code for the union, but i dont' like it because it's like to write the transition table of the NFA, as you would write it on a paper: example. I got this when I noticed that by building the table iteratively, you can define 2 "pointers" on the table, currentBeginning and currentEnd, that tell you where to expand again the character written on the table, with the next rule that the visitor will find on the parse tree. Because this character can be another production or just a single character. On the link it is represented the written-on-paper example that convinced me to use this approach.

TransitionTable.java

/* ... */
public void addUnion(String char1, String char2) {
    if (transitionTable.isEmpty()) {
    List<List<Integer>> lc1 = Arrays.asList(Arrays.asList(null)
            ,Arrays.asList(currentBeginning+3)
            ,Arrays.asList(null)
            ,Arrays.asList(null)
            ,Arrays.asList(null)
            ,Arrays.asList(null));
    List<List<Integer>> lc2 = Arrays.asList(Arrays.asList(null)
            ,Arrays.asList(null)
            ,Arrays.asList(currentBeginning+4)
            ,Arrays.asList(null)
            ,Arrays.asList(null)
            ,Arrays.asList(null));
    List<List<Integer>> le = Arrays.asList(Arrays.asList(currentBeginning+1,currentBeginning+2)
            ,Arrays.asList(null)
            ,Arrays.asList(null)
            ,Arrays.asList(currentBeginning+5)
            ,Arrays.asList(currentBeginning+5)
            ,Arrays.asList(null));

        transitionTable.put(char1, lc1);
        transitionTable.put(char2, lc2);
        transitionTable.put("epsilon", le);
        //currentBeginning += 2;
        //currentEnd = transitionTable.get(char2).get(currentBeginning).get(0);
        currentEnd = transitionTable.get("epsilon").size()-1;//il 5
        } else { //not the first time it encounters this rule, beginning and end changed
            //needs to add 2 less states
        }
    }
/* ... */

At the moment I'm representing the transition table as HashMap<String, List<List<Integer>>> strings are for chars on the edges of the NFA and List<List<Integer>> because by being non deterministic, it needs to represent more transitions from a single state. But going this way, for a parse tree like this i will obtain this line of code for the union : "tTable.addUnion("tTable.addConcat(a,b)","+char2+");"

And i'm blocked here, i don't know how to solve this and i really can't think a different way to represent the transition table or to fill it while visiting the parse tree.

Thank You.

Matt Timmermans Matt Timmermans · Accepted Answer · 2017-08-02T14:37:40

Using Thompson's construction, every regular (sub-)expression produces an NFA, and every regular expression operator (union, cat, *) can be implemented by adding a couple states and connecting them to states that already exists. See:

https://en.wikipedia.org/wiki/Thompson%27s_construction

So, when parsing the regex, every terminal or non-terminal production should add the required states and transitions to the NFA, and return its start and end state to the containing production. Non-terminal productions will combine their children and return their own start+end states so that your NFA can be built from the leaves of the regular expression up.

The representation of the state table is not critical for building. Thompson's construction will never require you to modify a state or transition that you built before, so you just need to be able to add new ones. You will also never need more than one transition from a state on the same character, or even more than one non-epsilon transition. In fact, if all your operators are binary you will never need more than 2 transitions on a state. Usually the representation is designed to make it easy to do the next steps, like DFA generation or direct execution of the NFA against strings.

For example, a class like this can completely represent a state:

class State
{
    public char matchChar;
    public State matchState; //where to go if you match matchChar, or null
    public State epsilon1; //or null
    public State epsilon2; //or null
}

This would actually be a pretty reasonable representation for directly executing an NFA. But if you already have code for directly executing an NFA, then you should probably just build whatever it uses so you don't have to do another transformation.

Antlr4 regular expression grammar, data structure for NFA transition table

1 Answers