Lex doesn't seem to follow precedence order

Question

I am using ply (a popular python implementation of Lex and Yacc) to create a simple compiler for a custom language.

Currently my lexer looks as follows:

reserved = {
    'begin': 'BEGIN',
    'end': 'END',
    'DECLARE': 'DECL',
    'IMPORT': 'IMP',
    'Dow': 'DOW',
    'Enddo': 'ENDW',
    'For': 'FOR',
    'FEnd': 'ENDF',
    'CASE': 'CASE',
    'WHEN': 'WHN',
    'Call': 'CALL',
    'THEN': 'THN',
    'ENDC': 'ENDC',
    'Object': 'OBJ',
    'Move': 'MOV',
    'INCLUDE': 'INC',
    'Dec': 'DEC',
    'Vibration': 'VIB',
    'Inclination': 'INCLI',
    'Temperature': 'TEMP',
    'Brightness': 'BRI',
    'Sound': 'SOU',
    'Time': 'TIM',
    'Procedure': 'PROC'
}

tokens = ["INT", "COM", "SEMI", "PARO", "PARC", "EQ", "NAME"] + list(reserved.values())

t_COM = r'//'
t_SEMI = r";"
t_PARO = r'\('
t_PARC = r'\)'
t_EQ = r'='
t_NAME = r'[a-z][a-zA-Z_&!0-9]{0,9}'

def t_INT(t):
    r'\d+'
    t.value = int(t.value)
    return t

def t_error(t):
    print("Syntax error: Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

Per the documentation, I am creating a dictionary for reserved keywords and then adding them to the tokens list, rather than adding individual rules for them. The documentation also states that precedence is decided following these 2 rules:

All tokens defined by functions are added in the same order as they appear in the lexer file.
Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).

The problem I'm having is that when I test the lexer using this test string

testInput = "// ; begin end DECLARE IMPORT Dow Enddo For FEnd CASE WHEN Call THEN ENDC (asdf) = Object Move INCLUDE Dec Vibration Inclination Temperature Brightness Sound Time Procedure 985568asdfLYBasdf ; Alol"

The lexer returns the following error:

LexToken(COM,'//',1,0)
LexToken(SEMI,';',1,2)
LexToken(NAME,'begin',1,3)
Syntax error: Illegal character ' '
LexToken(NAME,'end',1,9)
Syntax error: Illegal character ' '
Syntax error: Illegal character 'D'
Syntax error: Illegal character 'E'
Syntax error: Illegal character 'C'
Syntax error: Illegal character 'L'
Syntax error: Illegal character 'A'
Syntax error: Illegal character 'R'
Syntax error: Illegal character 'E'

(That's not the whole error but that's enough to see whats happening)

For some reason, Lex is parsing NAME tokens before parsing the keywords. Even after it's done parsing NAME tokens, it doesn't recognize the DECLARE reserved keyword. I have also tried to add reserved keywords with the rest of the tokens, using regular expressions but I get the same result (also the documentation advises against doing so).

Does anyone know how to fix this problem? I want the Lexer to identify reserved keywords first and then to attempt to tokenize the rest of the input.

Thanks!

EDIT:

I get the same result even when using the t_ID function exemplified in the documentation:

def t_NAME(t):
    r'[a-z][a-zA-Z_&!0-9]{0,9}'
    t.type = reserved.get(t.value,'NAME')
    return t

The documentation you linked has a t_ID function that actually uses the reserved map. You don't have that. You have t_NAME, but that's a regex and doesn't use reserved. — sepp2k

rici rici · Accepted Answer · 2019-09-09T19:35:04

The main problem here is that you are not ignoring whitespace; all the errors are a consequence. Adding a t_ignore definition to your grammar will eliminate those errors.

But the grammar won't work as expected even if you fix the whitespace issue, because you seem to be missing an important aspect of the documentation, which tells you how to actually use the dictionary reserved:

To handle reserved words, you should write a single rule to match an identifier and do a special name lookup in a function like this:

 reserved = {
    'if' : 'IF',
    'then' : 'THEN',
    'else' : 'ELSE',
    'while' : 'WHILE',
    ...
 }

 tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values())

 def t_ID(t):
     r'[a-zA-Z_][a-zA-Z_0-9]*'
     t.type = reserved.get(t.value,'ID')    # Check for reserved words
     return t

(In your case, it would be NAME and not ID.)

Ply knows nothing about the dictionary reserved and it also has no idea how you produce the token names enumerated in tokens. The only point of tokens is to let Ply know which symbols in the grammar represent tokens and which ones represent non-terminals. The mere fact that some word is in tokens does not serve to define the pattern for that token.

Lex doesn't seem to follow precedence order

1 Answers