3
votes

I'm trying to create a language using ANTLR where each line consists of an instruction, where an instruction is an opcode and any number of operands like so:

aaa "str1" "str2" 123
bbb 123 "str" 456
ccc
ddd

I have strings seemingly working OK, but integers seem to be parsed incorrectly.

Here's my complete grammar file:

grammar Insn;

prog: (line? NEWLINE)+;

line: instruction;
instruction: instruction_name instruction_operands?;

instruction_name: IDENTIFIER;
instruction_operands: instruction_operand instruction_operand*;
instruction_operand: ' '+ (operand_int | operand_string);

operand_int: INT;
operand_string: QSTRING;

NEWLINE : [\r\n]+;
IDENTIFIER: [a-zA-Z0-9_\-]+;
INT: '-'?[0-9]+;
QSTRING: '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"';
COMMENT: ';' ~[\r\n]* -> channel(HIDDEN);

I've tried multiple different INT definitions such as INT: '-'?('0'..'9')+; and INT: '2'; making all the INTs in the input 2, always resulting in an error similar to line 1:18 extraneous input '123' expecting {' ', INT, QSTRING}, with the line number, column and 123 integer replaced with whatever it was parsing.

Here's the parse tree generated by ANTLR's tooling as used in the ANTLR getting-started.md document.parse tree

I'm completely new to ANTLR and am not familiar with lots of terminology so please keep it simple for me.

1
I'm not quite sure, but I think INT: '-'?[0-9]+; may need an extra blank: INT: '-'? [0-9]+; - Dietmar Höhmann
@DietmarHöhmann just tried it, nothing changed. - Kirby Gaming
Indeed! I was wrong. The problem is one line above: 123is recognised as IDENTIFIER! Because it is a valid identifier (all INTs are). Both of them must be distinguishable. IDENTIFIER should probably be something like this IDENTIFIER: [a-zA-Z][a-zA-Z0-9_\-]*; - Dietmar Höhmann
@DietmarHöhmann thanks! I managed to take your idea and make it suit my needs by moving the INT definition before IDENTIFIER and making instruction_name: INT | IDENTIFIER; which seems to work for me now, I forgot to mention the requirement to keep instruction_name to be valid as an integer too. If you'd like to post your comment as an answer I'll accept it as it does answer the question I asked originally. - Kirby Gaming

1 Answers

3
votes

The problem is that 123is recognised as IDENTIFIER because it is a valid identifier (all INTs are). Both of them must be distinguishable. IDENTIFIER should probably be something like this IDENTIFIER: [a-zA-Z][a-zA-Z0-9_\-]*;