3
votes

Good evening, Stack Overflow. I'd like to develop an interpreter for expressions based on a pretty simple context-free grammar:

Grammar

Basically, the language is constituted by 2 base statements

( SET var 25 ) // Output: var = 25
( GET ( MUL var 5 ) ) // Output: 125
( SET var2 ( MUL 30 5 ) ) //Output: var2 = 150

Now, I'm pretty sure about what should I do in order to interpret a statement: 1) Lexical analysis to turn a statement into a sequence of tokens 2) Syntax analysis to get a symbol table (HashMap with the variables and their values) and a syntactic tree (to perform the GET statements) to 3) perform an inorder visit of the tree to get the results I want.

I'd like some advice on the parsing method to read the source file. Considering the parser should ignore any whitespace, tabulation or newline, is it possible to use a Java Pattern to get a general statement I want to analyze? Is there a good way to read a statement weirdly formatted (and possibly more complex) like this

(
  SET var

 25
 )

without confusing the parser with the open and closed parenthesises?

For example

Scanner scan; //scanner reading the source file
String pattern = "..." //ideal pattern I've found to represent an expression
while(scan.hasNext(pattern))
  Interpreter.computeStatement(scan.next(pattern));

would it be a viable option for this problem?

2
after you read all the file content and store it in a string, then for(string part: fileContent.split("\s+")){/* ... */}. now part stores meaningful characters. notice that you still have to perform lexical analysis to part, but it's easy.Jason Hu
The default delimiter for Scanner is to match white space, so you're all set to go with out adding any special regex at all.markspace
@markspace I'm aware that Scanner uses whitespace as delimiter, I was wondering how to make the Scanner recognise only one statement at once. Using Interpreter.computeStatement(scan.next()) should get only a token at once (a parenthesis, a keyword, a number or a variable)fedexist
The normal method is to lex (tokenize) the input file first, which is what I'm doing with Scanner. Then build a parse tree as you lex, which is what I assumed you wanted to do. Then walk the parse tree to interpret or compile your code.markspace
Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers.Ira Baxter

2 Answers

1
votes

Solution proposed by Ira Braxter:

Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers.


Maybe this will help: stackoverflow.com/a/2336769/120163

1
votes

In the end, I understood thanks to Ira Baxter that this context free grammar can't be parsed with RegExp and I used the concepts of S-Expressions to build up the interpreter, whose source code you can find here. If you have any question about it (mainly because the comments aren't translated in english, even though I think the code is pretty clear), just message me or comment here.

Basically what I do is:

  • Parse every character and tokenize it (e.g '(' -> is OPEN_PAR, while "SET" -> STATEMENT_SET or a random letter like 'b' is parsed as a VARIABLE )
  • Then, I use the token list created to do a syntactic analysis, which checks the patterns occuring inside the token list, according to the grammar
  • If there's an expression inside the statement, I check recursively for any expression inside an expression, throwing an exception and going to the following correct statement if needed
  • At the end of analysing every single statement, I compute the statement as necessary as for specifications