0
votes

I'm new to writing grammars, I've read about half of Antlr4: Definitive Guide and I thought I'd take a swing at the grammar I'm working on. I'm stuck on something that sounds basic but is proving to be more difficult than I thought.

I'm trying to parse smali (android disassembly) class declarations. It looks like this:

.class public Lcom/packageName/example;

The rule in smali is that all fully-qualified classes are prefixed with an L and then the package parts are separated by a / and then the class name is the last part before the ;.

So I have a a general WS : [ \t\r\n]+ -> skip ; In my lexer part, but I'm struggling with how to disable this when parsing a fullyQualifiedClass. I want to disable it so that the first classPackageComponent is returned without the L but if the L is forgotten, an Error: 'L' expected' will be shown. Same thing with package names, there can't be spaces in between package names and their / separators. I know the issue is that my Parser isn't even seeing the WS characters because the Lexer is just throwing them out. How should I be approaching the problem? Are channels the answer? I haven't gotten to the chapter yet but other SO posts suggest it might be.

My incorrect grammar code is below:

grammar smali;

smaliClass : classDeclaration;

classDeclaration : '.class' accessModifier fullyQualifiedClass;
accessModifier: 'public' | 'private';
fullyQualifiedClass: 'L' ~WS classPackage? className;

classPackage: (classPackageComponent ~WS '/')+;
classPackageComponent: Identifier;
className: Identifier;


Identifier
    :   Letter (Letter|JavaIDDigit)*
    ;

fragment
Letter
    :  '\u0024' |
       '\u0041'..'\u005a' |
       '\u005f' |
       '\u0061'..'\u007a' |
       '\u00c0'..'\u00d6' |
       '\u00d8'..'\u00f6' |
       '\u00f8'..'\u00ff' |
       '\u0100'..'\u1fff' |
       '\u3040'..'\u318f' |
       '\u3300'..'\u337f' |
       '\u3400'..'\u3d2d' |
       '\u4e00'..'\u9fff' |
       '\uf900'..'\ufaff'
    ;

fragment
JavaIDDigit
    :  '\u0030'..'\u0039' |
       '\u0660'..'\u0669' |
       '\u06f0'..'\u06f9' |
       '\u0966'..'\u096f' |
       '\u09e6'..'\u09ef' |
       '\u0a66'..'\u0a6f' |
       '\u0ae6'..'\u0aef' |
       '\u0b66'..'\u0b6f' |
       '\u0be7'..'\u0bef' |
       '\u0c66'..'\u0c6f' |
       '\u0ce6'..'\u0cef' |
       '\u0d66'..'\u0d6f' |
       '\u0e50'..'\u0e59' |
       '\u0ed0'..'\u0ed9' |
       '\u1040'..'\u1049'
   ;

WS : [ \t\r\n]+ -> skip ;
1

1 Answers

1
votes

A few issues to deal with first:

  1. The Lexer and Parser are largely separate: the Parser cannot modify the operation of the Lexer
  2. The Parser should deal in tokens, not characters or character strings -- the Lexer cannot see what is going on in the Parser, so you loose the ability of the Lexer to make certain optimizations
  3. The Lexer does not insert whitespace tokens between tokens that it recognizes

Lexer

Class: '.class' ;
Semi: ';' ; 
Modifier: 'public' | 'private';
Slash: '/' ;
ClassPrefix : 'L' { isPrefix() }? ;
Identifier :   Letter (Letter|JavaIDDigit)* ;
WS : [ \t\r\n]+ -> skip ;
...

Lexing .class public Lcom/packageName/example; (if isPrefix() always returns false) will produce the token stream:

Class, Modifier, Identifier, Slash, Identifier, Slash, Identifier, Semi

That is what the Parser will see. So, the parser rules become

classDeclaration : Class Modifier fullyQualifiedClass ;
fullyQualifiedClass: ClassPrefix? (Identifier Slash)* Identifier Semi ;

The problem with the ClassPrefix is that there is no natural separator that you can use to separate it out. There are a number of ways to get around this, though.

Perhaps the most direct way is for the Lexer to check, every time the Lexer sees an 'L', whether it is at the beginning of something that looks like a class name. That is what the predicate '{ isPrefix() }? is intended to do. Look here and here for examples of predicate implementations.

Another way is to drop the ClassPrefix rule from the Lexer entirely, and detect the prefix in the parser rule action or, better yet, in a subsequent walk of the parse tree:

fullyQualifiedClass: (Identifier Slash)* Identifier Semi ;

The first instance of Identifier is a Token that contains the underlying text matched by the token. Each instance of the generated parser class 'YourParser'.FullyQualifiedClassContext.Identifier() can be called to return a List of the Identifier tokens encountered in order left to right. Check the left most for the prefix and handle accordingly.