1
votes

I am new to Antlr and parsing, so this is a learning exercise for me.

I am trying to parse a language that allows free-format text in some locations. The free-format text may therefore be ANY word or words, including the keywords in the language - their location in the language's sentences defines them as keywords or free text.

In the following example, the first instance of "JOB" is a keyword; the second "JOB" is free-form text:

JOB=(JOB)

I have tried the following grammar, which avoids defining the language's keywords in lexer rules.

grammar Test;

test1   :   'JOB' EQ OPAREN (utext) CPAREN ;
utext   :    UNQUOTEDTEXT ;

COMMA           :   ',' ;
OPAREN          :   '(' ;
CPAREN          :   ')' ;
EQ              :   '=' ;
UNQUOTEDTEXT    :   ~[a-z,()\'\" \r\n\t]*? ;
SPC             :   [ \t]+      -> skip  ;

I was hoping that by defining the keywords a string literals in the parser rules, as above, that they would apply only in the location in which they were defined. This appears not to be the case. On testing the "test1" rule (with the Antlr4 plug-in in IDEA), and using the above example phrase shown above - "JOB=(JOB)" (without quotes) - as input, I get the following error message:

line 1:5 mismatched input 'JOB' expecting UNQUOTEDTEXT

So after creating an implicit token for 'JOB', it looks like Antlr uses that token in other points in the parser grammar, too, i.e. whenever it sees the 'JOB' string. To test this, I added another parser rule:

test2   :   'DATA' EQ OPAREN (utext) CPAREN ;

and tested with "DATA=(JOB)"

I got the following error (similar to before):

line 1:6 mismatched input 'JOB' expecting UNQUOTEDTEXT

Is there any way to ask Antlr to enforce the token recognition in the locations only where it is defined/introduced?

Thanks!

2
As a side note, never create lexer rules that potentially match zero chars (like your UNQUOTEDTEXT rule). This might cause the lexer to produce an infinite amount of empty-string tokens.Bart Kiers
Will do. Thank you - I appreciate your guidance. I'm starting to read your blog on creating TL just now. There's a lot to learn!v0rl0n

2 Answers

2
votes

What you have is essentially a Lake grammar, the opposite of an island grammar. A lake grammar is one in which you mostly have structured text and then lakes of stuff you don't care about. Generally the key is having some lexical Sentinel that says "enter unstructured text area" and then " reenter structured text area". In your case it seems to be (...). ANTLR has the notion of a lexical mode, which is what you want to to handle areas with different lexical structures. When you see a '(' you want to switch modes to some free-form area. When you see a ')' in that area you want to switch back to the default mode. Anyway "mode" is your key word here.

0
votes

I had a similar problem with keywords that are sometimes only identifiers. I did it this way:

 OnlySometimesAKeyword : 'value' ;

 identifier 
     :   Identifier // defined as usual
     |   maybeKeywords
     ;

 maybeKeywords
     :   OnlySometimesAKeyword 
     // ...
     ;

In your parser rules simply use identifier instead of Identifier and you'll also be able to match the "maybe keywords". This will of course also match them in places where they will be keywords, but you could check this in the parser if necessary.