3
votes

So I'm writing simple parsers for some programming languages in SWI-Prolog using Definite Clause Grammars. The goal is to return true if the input string or file is valid for the language in question, or false if the input string or file is not valid.

In all almost all of the languages there is an "identifier" predicate. In most of the languages the identifier is defined as the one of the following in EBNF: letter { letter | digit } or ( letter | digit ) { letter | digit }, that is to say in the first case a letter followed by zero or more alphanumeric characters, or i

My input file is split into a list of word strings (i.e. someIdentifier1 = 3 becomes the list [someIdentifier1,=,3]). The reason for the string to be split into lists of words rather than lists of letters is for recognizing keywords defined as terminals.

How do I implement "identifier" so that it recognizes any alphanumeric string or a string consisting of a letter followed by alphanumeric characters.

Is it possible or necessary to further split the word into letters for this particular predicate only, and if so how would I go about doing this? Or is there another solution, perhaps using SWI-Prolog libraries' built-in predicates?

I apologize for the poorly worded title of this question; however, I am unable to clarify it any further.

1

1 Answers

4
votes

First, when you need to reason about individual letters, it is typically most convenient to reason about lists of characters.

In Prolog, you can easily convert atoms to characters with atom_chars/2.

For example:

?- atom_chars(identifier10, Cs).
Cs = [i, d, e, n, t, i, f, i, e, r, '1', '0'].

Once you have such characters, you can used predicates like char_type/2 to reason about properties of each character.

For example:

?- char_type(i, T).
T = alnum ;
T = alpha ;
T = csym ;
etc.

The general pattern to express identifiers such as yours with DCGs can look as follows:

identifier -->
        [L],
        { letter(L) },
        identifier_rest.

identifier_rest --> [].
identifier_rest -->
        [I],
        { letter_or_digit(I) },
        identifier_rest.

You can use this as a building block, and only need to define letter/1 and letter_or_digit/1. This is very easy with char_type/2.

Further, you can of course introduce an argument to relate such lists to atoms.