1
votes

So, I'm converting an Antlr3 project, which reads a C++ header file with the lexer, to Antlr4. I have a problem converting a specific line to the new Antlr4 grammar.

The original line from the Antlr3 grammar is:

DEFINE                  :   '#define' ~(' WM_NEWUSER (WM_USER + 2)');

Using this will result in a

multi-character literals are not allowed in lexer sets: 'WM_NEWUSER' error in the antlr4 lexer.

I tried wrapping the multi-character literals as mentioned in How to fix the "multi-character literals are not allowed" error in antlr4 lexer rule? but it didn't work or I made something wrong.

Edit: Original Grammar

grammar Grammar;

@lexer::header {package main;}

WS                      :   (' '|'\t')*;
NEW_LINE                :   '\r'? '\n';
IGNORE_LINES            :   '//@MySQL:IGNORE LINES:' ('0'..'9')+ (' '|'\t')* '\r'? '\n';
IGNORE_LINES2           :   '/*@MySQL:IGNORE LINES:' ('0'..'9')+ (' '|'\t'|'\r'|'\n')* '*/';
IGNORE_KEY              :   '//@MySQL:IGNORE KEY' (' '|'\t')* '\r'? '\n';
IGNORE_KEY2             :   '/*@MySQL:IGNORE KEY' (' '|'\t'|'\r'|'\n')* '*/';
IGNORE_STRUCT           :   '//@MySQL:IGNORE STRUCT' (' '|'\t')* '\r'? '\n';
IGNORE_STRUCT2          :   '/*@MySQL:IGNORE STRUCT' (' '|'\t'|'\r'|'\n')* '*/';
PRIMARY                 :   '//@MySQL:PRIMARY' (' '|'\t')* '\r'? '\n';
PRIMARY2                :   '/*@MySQL:PRIMARY' (' '|'\t'|'\r'|'\n')* '*/';
OPKZ_ZUORD              :   '//@MySQL:OPKZ:' (VARNAME ';')+ (' '|'\t')* '\r'? '\n';
OPKZ_ZUORD2             :   '/*@MySQL:OPKZ:' (VARNAME ';')+ (' '|'\t'|'\r'|'\n')* '*/';
DESC_KEY                :   '//@MySQL:DESC' (' '|'\t')* '\r'? '\n';
DESC_KEY2               :   '/*@MySQL:DESC' (' '|'\t'|'\r'|'\n')* '*/';
MYSQL_KEY_INFO_MULTI    :   '//@MySQL:MULTIKEY:' .* '\n';
MYSQL_KEY_INFO_MULTI2   :   '/*@MySQL:MULTIKEY:' .* '*/';
MYSQL_KEY_INFO_SPECIAL  :   '//@MySQL:SPECIALKEY:' .* '\n';
MYSQL_KEY_INFO_SPECIAL2 :   '/*@MySQL:SPECIALKEY:' .* '*/';
MYSQL_KEY_INSENSITIVE   :   '//@MySQL:INSENSITIVEKEY:' .* '\n';
COMMENT                 :   '/*' .* '*/';
LINE_COMMENT            :   '//' ~('\n'|'\r')* '\r'? '\n';
SIGNED_UNSIGNED         :   'signed'|'unsigned';
TYPE                    :   ('char'|'short'|'int'|'long'|'__int8'|'__int16'|'__int32'|'__int64'|'bool'|'float'|'double');
RESERVE                 :   'reserve'|'szReserve';
TYPEDEF                 :   'typedef';
ENUM                    :   'enum';
STRUCT                  :   'struct';
UNION                   :   'union';
CONST                   :   'const';
DEFINE                  :   '#define' ~(' WM_NEWUSER (WM_USER + 2)');
WM_USER_PLUS_2          :   '#define WM_NEWUSER (WM_USER + 2)' (' '|'\t')* '\r'? '\n';
BRACKET_OPEN            :   '(';
BRACKET_CLOSE           :   ')';
CURLY_BRACE_OPEN        :   '{';
CURLY_BRACE_CLOSE       :   '}';
SQUARE_BRACKET_OPEN     :   '[';
SQUARE_BRACKET_CLOSE    :   ']';
SEMI                    :   ';';
PLUS                    :   '+';
MINUS                   :   '-';
EQUALS                  :   '=';
MAL                     :   '*';
BACKSLASH               :   '\\';
KOMMA                   :   ',';
NUMBER                  :   (('0'..'9')+)|(('0x') (('0'..'9')|('a'..'f')|('A'..'F'))+)|('\'' '\\'? ('a'..'z'|'A'..'Z') '\'');
VARNAME                 :       ('a'..'z' | 'A'..'Z' | '_' ) ('0'..'9' | 'a'..'z' | 'A'..'Z' | '_' )*;
VERODERT                :   '(' (' '|'\t')* VARNAME (' '|'\t')* ('|' (' '|'\t')* VARNAME (' '|'\t')* )* ')';
PRAGMA_ONCE             :   '#pragma' (' '|'\t')+ 'once';
IF_NOT_DEFINED1         :   '#if' (' '|'\t')+ '!' (' '|'\t')* 'defined' (' '|'\t')* VARNAME (' '|'\t')* '\r'? '\n';
IF_DEFINED1             :   '#if' (' '|'\t')+ 'defined' (' '|'\t')* VARNAME (' '|'\t')* '\r'? '\n';
IF_NOT_DEFINED2         :   '#if' (' '|'\t')+ '!' 'defined' (' '|'\t')* '(' (' '|'\t')* VARNAME (' '|'\t')* ')' '\r'? '\n';
IF_DEFINED2             :   '#if' (' '|'\t')+ 'defined' (' '|'\t')* '(' (' '|'\t')* VARNAME (' '|'\t')* ')' '\r'? '\n';
ENDIF                   :   '#endif';

So any hint how to resolve my issue?

1
Are you sure that is correct? The negation of ' WM_NEWUSER (WM_USER + 2)' seems odd.Bart Kiers
It's not my code I'm just the one who has to take care, but yes it works. It does exaclty match every #define in the header except #define WM_NEWUSER (WM_USER + 2)Falk
It's been a while since I used ANTLR3, but I don't think it's a valid ANTLR3 lexer rule. Isn't it a parser rule in ANTLR3? Can you post the v3 grammar?Bart Kiers
Original grammar for the lexer addedFalk

1 Answers

1
votes

The negation of ' WM_NEWUSER (WM_USER + 2)' has more or less undefined behaviour in ANTLR 3.

In lexer rules, ~ negates character classes. and will always match a single character. It cannot negate the entire string ' WM_NEWUSER (WM_USER + 2)'.

Test it yourself with the input: #define foobar. There will be 2 tokens:

  1. DEFINE token, with text #define_ (note that the _ is a space after define!)
  2. VARNAME token, with text foobar

And if you tokenise #definefoobar, you also get 2 tokens:

  1. DEFINE token, with text #definef
  2. VARNAME token, with text oobar

As you can see, the negated part after '#define' will always match a single character.

Since what is being negated isn't a proper character set, you might as well have written the rule like this:

DEFINE : '#define' .;

Yes, that will behave the same as:

DEFINE : '#define' ~(' WM_NEWUSER (WM_USER + 2)');

A couple of other observations:

  • ANTLR3's lexer is bad at backtracking if it cannot create a token at a "late stage". Try tokenising #define WM_NEWUSER (WM_USER + 23). When it stumbles upon the 3, it will have to give up on the rule WM_USER_PLUS_2, but it cannot find another lexer rule for the characters that it already consumed, and will produce an error.
  • I see a lot of .* '\n' in your lexer: this is a bad habit, try to avoid .* whenever you can. Use ~('\n')* '\n' instead
  • the WS rule matches an empty string, which is a no-no for lexer rules (there is an infinite amount of empty strings, and might cause your lexer to grind to a halt at runtime)
  • your LINE_COMMENT forces there to be a line break at the end. This will fail when the end of your input has a line comment (without a line break at the end)

My suggestion: throw that v3 grammar away and either start from scratch or try to find a open source grammar that suits your needs.