Im using the PLY python package to build a front-end compiler for a subset of C, but Im still struggling to understand how the bottom up parser(specifically LALR) works. I wrote a simple program with the given grammar to see how things work:
expression -> vowel_list consonant_list
vowel_list -> VOWEL vowel_list
| VOWEL
consonant_list : CONSONANT consonant_list
| CONSONANT
And used the input string: 'EUAIOTNC'
As an action for the different grammar rules, I use a list where I append all the tokens which are seen by the parser. Basically, the token is appended to the list when a reduction happens. At the end of parsing when the start symbol expression is obtained, the list reads: ['O', 'I', 'A', 'U', 'E', 'C', 'N', 'T'].
import ply.lex as lex
import ply.yacc as yacc
tokens = ['VOWEL','CONSONANT']
t_VOWEL = r'[AEIOUaeiou]'
t_CONSONANT = r'[A-Za-z]'
def t_error(token):
print(f'Illegal character: {token.value} on line number {t.lineno}')
token.lexer.skip(1)
def t_whitespace(t):
r'\s+'
pass
lexer = lex.lex()
tokens_generated = []
def p_expression(p):
'''
expression : vowel_list consonant_list
'''
print(p.stack)
print('Derivation complete!')
print(f'The tokens generated are: {tokens_generated}')
def p_vowel_list(p):
'''
vowel_list : VOWEL vowel_list
| VOWEL
'''
print(f'Derived the vowel {p[1]} !')
tokens_generated.append(p[1])
def p_consonant_list(p):
'''
consonant_list : CONSONANT consonant_list
| CONSONANT
'''
print(f'Derived the consonant {p[1]}!')
tokens_generated.append(p[1])
def p_error(p):
if p == None:
token = "end of file"
else:
token = f"{p.type}({p.value}) on line {p.lineno}"
input_string = 'EUAIOTNC'
parser = yacc.yacc(method='LALR',debug=True)
parser.parse(input_string)
The output I get is:
Derived the vowel O !
Derived the vowel I !
Derived the vowel A !
Derived the vowel U !
Derived the vowel E !
Derived the consonant C!
Derived the consonant N!
Derived the consonant T!
[$end]
Derivation complete!
The tokens generated are: ['O', 'I', 'A', 'U', 'E', 'C', 'N', 'T']
Im not able to figure out how this is happening. I had expected the list to contain the tokens in the same order as seen in the string. Can someone explain why Im getting what Im getting? How is the LALR parser parsing my input?