21
votes

I am trying to build a Lisp grammar. Easy, right? Apparently not.

I present these inputs and receive errors...

( 1 1)
23 23 23 
ui ui

This is the grammar...

%%
sexpr: atom                 {printf("matched sexpr\n");}
    | list
    ;
list: '(' members ')'       {printf("matched list\n");}
    | '('')'                {printf("matched empty list\n");}
    ;
members: sexpr              {printf("members 1\n");}
    | sexpr members         {printf("members 2\n");}
    ;
atom: ID                    {printf("ID\n");}
    | NUM                   {printf("NUM\n");}
    | STR                   {printf("STR\n");}
    ;
%%

As near as I can tell, I need a single non-terminal defined as a program, upon which the whole parse tree can hang. But I tried it and it didn't seem to work.

edit - this was my "top terminal" approach:

program: slist;

slist: slist sexpr | sexpr;

But it allows problems such as:

( 1 1 

Edit2: The FLEX code is...

%{
    #include <stdio.h>
    #include "a.yacc.tab.h"
    int linenumber;
    extern int yylval;
%}
%%
\n                         { linenumber++; }
[0-9]+                     { yylval = atoi(yytext); return NUM; }
\"[^\"\n]*\"               { return STR; }
[a-zA-Z][a-zA-Z0-9]*       { return ID; }
.
%%

An example of the over-matching...

(1 1 1)
NUM
matched sexpr
NUM
matched sexpr
NUM
matched sexpr
(1 1
NUM
matched sexpr
NUM
matched sexpr

What's the error here?

edit: The error was in the lexer.

7
What does your output look like when it parses the ( 1 1 ? I can't see how it gets to a point of not expecting the closing ).Beardo
Could you please post your lex/flex file too? Maybe there's an error. Also, you should not use characters '(' in the grammar if you use a lexer, I'm not really sure how these get along.jpalecek
What's strange is that even the valid list (1 1 1) doesn't come up as a match for a list. I would try two things, first make members left recursive: members: members sexpr | sexpr; Second, swap the order of list in sexpr: list | atom; See if that works.Beardo
Beardo, jpalecek had the answer - for this bug. :)Paul Nathan
One of your comments down the page suggests that you will be using this on an embedded device. If so, please mention this in your question.SingleNegationElimination

7 Answers

12
votes

Lisp grammar can not be represented as context-free grammar, and yacc can not parse all lisp code. It is because of lisp features such as read-evaluation and programmable reader. So, in order just to read an arbitrary lisp code, you need to have a full lisp running. This is not some obscure, non-used feature, but it is actually used. E.g., CL-INTERPOL, CL-SQL.

If the goal is to parse a subset of lisp, then the program text is a sequence of sexprs.

12
votes

The error is really in the lexer. Your parentheses end up as the last "." in the lexer, and don't show up as parentheses in the parser.

Add rules like

\)     { return RPAREN; }
\(     { return LPAREN; }

to the lexer and change all occurences of '(', ')' to LPAREN and RPAREN respectively in the parser. (also, you need to #define LPAREN and RPAREN where you define your token list)

Note: I'm not sure about the syntax, could be the backslashes are wrong.

4
votes

You are correct in that you need to define a non-terminal. That would be defined as a set of sexpr. I'm not sure of the YACC syntax for that. I'm partial to ANTLR for parser generators and the syntax would be:

program: sexpr*

Indicating 0 or more sexpr.

Update with YACC syntax:

program :  /* empty */
        | program sexpr
        ;

Not in YACC, but might be helpful anyway, here's a full grammar in ANTLR v3 that works for the cases you described(excludes strings in the lexer because it's not important for this example, also uses C# console output because that's what I tested it with):

program: (sexpr)*;

sexpr: list
    |  atom            {Console.WriteLine("matched sexpr");}
    ;

list:     
   '('')'              {Console.WriteLine("matched empty list");}
   | '(' members ')'   {Console.WriteLine("matched list");}

    ;

members: (sexpr)+      {Console.WriteLine("members 1");};

atom: Id               {Console.WriteLine("ID");}
    | Num              {Console.WriteLine("NUM");}
    ;


Num: ( '0' .. '9')+;
Id: ('a' .. 'z' | 'A' .. 'Z')+;
Whitespace : ( ' ' | '\r' '\n' | '\n' | '\t' ) {Skip();};

This won't work exactly as is in YACC because YACC generates and LALR parser while ANTLR is a modified recursive descent. There is a C/C++ output target for ANTLR if you wanted to go that way.

2
votes

Do you neccesarily need a yacc/bison parser? A "reads a subset of lisp syntax" reader isn't that hard to implement in C (start with a read_sexpr function, dispatch to a read_list when you see a '(', that in turn builds a list of contained sexprs until a ')' is seen; otherwise, call a read_atom that collects an atom and returns it when it can no longer read atom-constituent characters).

However, if you want to be able to read arbritary Common Lisp, you'll need to (at the worst) implement a Common Lisp, as CL can modify the reader run-time (and even switch between different read-tables run-time under program control; quite handy when you're wanting to load code written in another language or dialect of lisp).

1
votes

It's been a long time since I worked with YACC, but you do need a top-level non-terminal. Could you be more specific about "tried it" and "it didn't seem to work"? Or, for that matter, what the errors are?

I'd also suspect that YACC might be overkill for such a syntax-light language. Something simpler (like recursive descent) might work better.

1
votes

I just tried it, my "yacc lisp grammar" works fine :

%start exprs

exprs:
    | exprs expr
    /// if you prefer right recursion :
    /// | expr exprs
    ;

list:
    '(' exprs ')'
    ;

expr:
    atom
    | list
    ;

atom:
    IDENTIFIER
    | CONSTANT
    | NIL
    | '+'
    | '-'
    | '*'
    | '^'
    | '/'
    ;