0
votes

Background

I'm working on a compiler for a latex-like language. I've already written the lex file and that's working the way it ought to, so far. However, I've been running into problems now that I'm working on the grammar in the .y file.

Problem

I've reproduced the part of the grammar that I believe is responsible for stymieing me:

%start document
%%
document: BEGINDOCUMENT documentbody ENDDOCUMENT;
documentbody: contentseq | ws MAKETITLE contentseq | MAKETITLE contentseq;
contentseq: | contentseq content;
content: STRING | ws;
ws: WHITESPACE;

Whitespace, in this context, is basically any mix of spaces, tabs, and newlines.

As I understand it from looking at the y.output file, the shift/reduce error is raised because of the rule

documentbody: ... | ws MAKETITLE contentseq | ...

Given the WHITESPACE token, Bison doesn't know whether this is part of the terminal "content" or if it will instead be followed by a MAKETITLE token. Both are perfectly valid input, and I'm unsure of how to fix this issue.

For clarity, a paraphrase of the original EBNF specification:

document: BEGINDOCUMENT [ws] [MAKETITLE] contentseq ENDDOCUMENT

In other words, both the ws terminal and MAKETITLE are optional.

Example input

BEGINDOCUMENT WHITESPACE MAKETITLE STRING ENDDOCUMENT
BEGINDOCUMENT WHITESPACE STRING ENDDOCUMENT
BEGINDOCUMENT MAKETITLE STRING ENDDOCUMENT
BEGINDOCUMENT STRING ENDDOCUMENT

All of the above should be accepted by the grammar.

What I've tried

I know many conflicts can be smoothed over by using precedence, but nothing I've tried in that vein has worked. I've tried assigning the MAKETITLE and WHITESPACE tokens every kind of precedence, but that hasn't solved the problem.

I've seen suggestions on other shift/reduce-related problems to rewrite the grammar to be less ambigous, but I'm not sure how to go about doing that - at least without changing what input the grammar accepts and does not accept.

One solution I have thought of, but not attempted, is messing with the lex file, but that seems a rather icky solution and I would rather find some way of doing it in yacc.

2
I don't see the point of having whitespace in the grammar at all. Just remove it. Let the lexer consume it.user207421

2 Answers

2
votes

The conflict basically results from the nullability of contentseq. That forces the parser to recognise an empty contentseq before it recognises a longer contentseq. And that produces a conflict when the input starts BEGINDOCUMENT WHITESPACE, because at the point before the WHITESPACE, it is not knowable whether that empty contentseq should be reduced.

You can easily resolve that by making contentseq non-nullable (contentseq: content | contentseq content), at the cost of having to explicitly handle omitted sequences:

documentbody: %empty | contentseq | maketitle optionalcs
contentseq: content | contentseq content
optionalcs: %empty | contentseq
maketitle: WHITESPACE MAKETITLE | MAKETITLE

This is a general problem with converting EBNF's optional syntax [ x ], particularly when x is repeated. You cannot always rely on being able to define optional-x; you often have to create two right-hand sides, one with x and the other without.


I don't see the point of ws: WHITESPACE; you could just use a WHITESPACE token instead of the ws non-terminal. If your grammar is more complicated than what you show, that non-terminal could be triggering a conflict but I don't see any ambiguity in what you have pasted. Nonetheless, in the sample solution above, I removed the redundant non-terminal.

0
votes

My personal preference is to avoid tricks specific to the tool, and define the grammar to more accurately describe what we want to recognize. I believe a grammar on this order recognizes the documents you want:

%start document
%token BEGINDOCUMENT ENDDOCUMENT MAKETITLE STRING WS
%%
document: BEGINDOCUMENT documentbody ENDDOCUMENT
        ;

documentbody: prefix title contents
            ;

prefix: 
        | WS
        ;

title: 
    | MAKETITLE
    ;

contents: 
        | STRING contentseq
        ;

contentseq: 
          | contentseq content
          ;

content: STRING 
        | WS
        ;

So, we start with an optional prefix of some white space. That's followed by an optional title. That's followed by the contents, which (since we've already recognized leading white space) is either empty, or a string followed by either strings or spaces.

Simple, straightforward, and easy for just about anybody to understand (assuming they recognize yacc notation at all, of course).