I'm new to writing grammars, I've read about half of Antlr4: Definitive Guide and I thought I'd take a swing at the grammar I'm working on. I'm stuck on something that sounds basic but is proving to be more difficult than I thought.
I'm trying to parse smali (android disassembly) class declarations. It looks like this:
.class public Lcom/packageName/example;
The rule in smali is that all fully-qualified classes are prefixed with an L
and then the package parts are separated by a /
and then the class name is the last part before the ;
.
So I have a a general WS : [ \t\r\n]+ -> skip ;
In my lexer part, but I'm struggling with how to disable this when parsing a fullyQualifiedClass
. I want to disable it so that the first classPackageComponent
is returned without the L
but if the L
is forgotten, an Error: 'L' expected'
will be shown. Same thing with package names, there can't be spaces in between package names and their /
separators. I know the issue is that my Parser isn't even seeing the WS characters because the Lexer is just throwing them out. How should I be approaching the problem? Are channels the answer? I haven't gotten to the chapter yet but other SO posts suggest it might be.
My incorrect grammar code is below:
grammar smali;
smaliClass : classDeclaration;
classDeclaration : '.class' accessModifier fullyQualifiedClass;
accessModifier: 'public' | 'private';
fullyQualifiedClass: 'L' ~WS classPackage? className;
classPackage: (classPackageComponent ~WS '/')+;
classPackageComponent: Identifier;
className: Identifier;
Identifier
: Letter (Letter|JavaIDDigit)*
;
fragment
Letter
: '\u0024' |
'\u0041'..'\u005a' |
'\u005f' |
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
fragment
JavaIDDigit
: '\u0030'..'\u0039' |
'\u0660'..'\u0669' |
'\u06f0'..'\u06f9' |
'\u0966'..'\u096f' |
'\u09e6'..'\u09ef' |
'\u0a66'..'\u0a6f' |
'\u0ae6'..'\u0aef' |
'\u0b66'..'\u0b6f' |
'\u0be7'..'\u0bef' |
'\u0c66'..'\u0c6f' |
'\u0ce6'..'\u0cef' |
'\u0d66'..'\u0d6f' |
'\u0e50'..'\u0e59' |
'\u0ed0'..'\u0ed9' |
'\u1040'..'\u1049'
;
WS : [ \t\r\n]+ -> skip ;