4
votes

One of my programs accepts a commands (like kill foo) at runtime. Think of it as a little domain-specific language. Here are a few examples:

kill
kill client
exit

But also, chained commands are allowed and whitespace is not significant before and after commands, so the following examples are also valid:

kill ; say "that was fun"
  kill  ;  kill      ; kill;

I have currently implemented this with lex/yacc (flex/bison to be specific) and that caused a lot of headache. The lexer very much depends on the context (for example whitespace tokens are generally not returned, unless after a kill keyword for example) and has many different states. The grammar used to have conflicts and I don’t really like the format in which it has to be specified (especially the $1, $2, $3, … to use arguments for non-terminals). Also, the error messages which bison provides (at parse-time) are sometimes accurate, but often not (the kill command with optional arguments leads to error messages like Unexpected $undefined, expected $end or ; for kill clont instead of kill client). Lastly, the C API for yacc is cruel (external defines all over the place).

I am not asking you to solve all the aforementioned questions (I will open separate threads with more specific descriptions and code if there is no way around lex/yacc). Instead, I am interested in alternatives to lex/yacc.

My criteria are the following:

  • Input is a string (const char *), there is no output but instead some code should be called for each different keyword.
  • I want to use this with C (C99).
  • The software should be already included in the major linux distros or at least easy to bundle / package.
  • It should be well-documented.
  • The syntax for describing my language should be easy.
  • It should output meaningful error messages upon parsing errors.
  • Performance is not that important (of course it should be fast, but the typical use case is interactive usage, not processing tons of MB of commands).
6
@Philip: Yeah, you kind of get the feeling I had after messing with all the lexer/parser stuff… :)Michael

6 Answers

3
votes

As for a very simple and small grammar, I'd consider writing the lexer/parser by hand - it's often not that much work.

Virtually all linux distros ship a variant of lex/yacc. Other than that, two other widely used parser generators are lemon and antlr.

3
votes

Since your language looks pretty simple, I suggest implementing a finite state machine that tokenizes and parses the input.

Simply read the input one character at a time, tokenizing at whitespace (while not in a quoted string). Each "command" takes the machine in a different state where it parses the command arguments. ";" or "\n" reset the machine to its start state.

1
votes

I like ANTLR a lot, having used it a couple times in production systems.

Something quirky about it is that in version 2 it supported generating C++ code but not C, whereas in version 3 it supports generating C code but not C++. I like C++ and so still use ANTLR v2, but you will probably enjoy v3. So much the better for you.

Many distros have ANTLR v2 packages, and some have v3 as well. It's fairly well documented (note that I use v2; I hope v3 is not worse in this regard).

ANTLR does not generate super awesome parsing error messages "out of the box." This seems to be something common to most general parser systems, and is fundamentally not an easy problem to solve. With some work, however, I've seen some decent diagnostic output coming from an ANTLR-based system (the application had some logic to help figure out what to say to the user--ANTLR doesn't have much magic here).

1
votes

An interesting alternative to Lex & Yacc is the Lemon parser. It has quite a lot going for it, but I have not used it in earnest, so I'm not completely sure how well it really works. It is used by SQLite.

1
votes

You may want to consider Ragel. I recently started using it and have found it a pleasure to work with once you get up to speed. In your example, you might do something like (note: not tested!):

#include <stdio.h>
#include <string.h>

%%{
    machine my_cmd_lang;

    action pk { printf("Killing %.*s\n", fpc-mark, mark); }
    action mk { mark = fpc; }

    k = 'kill'; # creates a machine that doesn't do anything
    x = 'exit' @{ printf("Exiting\n"); };
    arg = alpha+ >mk; # arg to kill is built in machine 'alpha' 1 or more times
    cmd = ((k space arg) @pk space* ';'?) | x;
    main := cmd* ;
}%%

%% write data;

int main(int argc, char* argv[]) {
    int cs;
    char* p = "kill client";
    char* pe = p + strlen(p);
    char* mark;

    %% write init;
    %% write exec;

    return 0;
}

Run it through Ragel with ragel <filename.rl> and it will spit out <filename.c>.

0
votes

You'd need a lexerless parser (e.g., an implementation of PEG). As you're using C and already familiar with yacc, something like this might be worth trying.

And if your grammar is simple enough, you can implement an ad hoc recursive descent parser instead.