1
votes

Im using the PLY python package to build a front-end compiler for a subset of C, but Im still struggling to understand how the bottom up parser(specifically LALR) works. I wrote a simple program with the given grammar to see how things work:

expression -> vowel_list consonant_list

vowel_list -> VOWEL vowel_list
        | VOWEL

consonant_list : CONSONANT consonant_list
            | CONSONANT

And used the input string: 'EUAIOTNC'

As an action for the different grammar rules, I use a list where I append all the tokens which are seen by the parser. Basically, the token is appended to the list when a reduction happens. At the end of parsing when the start symbol expression is obtained, the list reads: ['O', 'I', 'A', 'U', 'E', 'C', 'N', 'T'].

import ply.lex as lex
import ply.yacc as yacc
tokens = ['VOWEL','CONSONANT']

t_VOWEL = r'[AEIOUaeiou]'
t_CONSONANT = r'[A-Za-z]'

def t_error(token):
    print(f'Illegal character: {token.value} on line number {t.lineno}')
    token.lexer.skip(1)

def t_whitespace(t):
    r'\s+'
    pass

lexer = lex.lex()

tokens_generated = []

def p_expression(p):
    '''
    expression : vowel_list consonant_list
    '''

    print(p.stack)
    print('Derivation complete!')
    print(f'The tokens generated are: {tokens_generated}')


def p_vowel_list(p):
    '''
    vowel_list : VOWEL vowel_list
                | VOWEL
    '''
    print(f'Derived the vowel {p[1]} !')
    tokens_generated.append(p[1])

def p_consonant_list(p):
    '''
    consonant_list : CONSONANT consonant_list
                    | CONSONANT
    '''
    print(f'Derived the consonant {p[1]}!')
    tokens_generated.append(p[1])


def p_error(p):
    if p == None:
        token = "end of file"
    else:
        token = f"{p.type}({p.value}) on line {p.lineno}"

input_string = 'EUAIOTNC'

parser = yacc.yacc(method='LALR',debug=True)
parser.parse(input_string)

The output I get is:

Derived the vowel O !
Derived the vowel I !
Derived the vowel A !
Derived the vowel U !
Derived the vowel E !
Derived the consonant C!
Derived the consonant N!
Derived the consonant T!
[$end]
Derivation complete!
The tokens generated are: ['O', 'I', 'A', 'U', 'E', 'C', 'N', 'T']

Im not able to figure out how this is happening. I had expected the list to contain the tokens in the same order as seen in the string. Can someone explain why Im getting what Im getting? How is the LALR parser parsing my input?

1

1 Answers

1
votes

It's parsing it like you told it to. When you say:

vowel_list: VOWEL vowel_list

what you are saying is:

  1. Recognise a VOWEL.

  2. Completely handle a vowel_list (which includes performing that vowel_list's reduction action).

  3. Finally, perform the associated reduction action for this vowel_list.

It should be evident why that results in printing the VOWELs right-to-left. The first VOWEL is printed in the outermost vowel_list's reduction action, which only is performed after all the inner vowel_lists' reduction actions.

If you had written the grammar in the normal way for bottom-up parsing, using left-recursion, the actions would have resulted in the print order you expected:

vowel_list: vowel_list VOWEL

In effect, that grammar means "to recognise a vowel_list, first recognise a shorter vowel_list and then add another VOWEL."

None of this really has to do with the LALR algorithm. It's inherent in the grammar you wrote. Any parser which executes actions when the production is complete would necessarily act the same. And that's really the only sensible time to execute the action, because normally every action depends on knowing the computation associated with its components.

What's important about the LR algorithms is that they permit parsing left-recursive grammars, allowing you to use the natural execution order. You could only do this with a right-recursive rule if you were to inject an action in the middle of the rule, or encapsulate every VOWEL in a unit production with its own separate action.