Parsing regular expressions based on a context free grammar

Question

Good evening, Stack Overflow. I'd like to develop an interpreter for expressions based on a pretty simple context-free grammar:

Grammar

Basically, the language is constituted by 2 base statements

( SET var 25 ) // Output: var = 25
( GET ( MUL var 5 ) ) // Output: 125
( SET var2 ( MUL 30 5 ) ) //Output: var2 = 150

Now, I'm pretty sure about what should I do in order to interpret a statement: 1) Lexical analysis to turn a statement into a sequence of tokens 2) Syntax analysis to get a symbol table (HashMap with the variables and their values) and a syntactic tree (to perform the GET statements) to 3) perform an inorder visit of the tree to get the results I want.

I'd like some advice on the parsing method to read the source file. Considering the parser should ignore any whitespace, tabulation or newline, is it possible to use a Java Pattern to get a general statement I want to analyze? Is there a good way to read a statement weirdly formatted (and possibly more complex) like this

(
  SET var

 25
 )

without confusing the parser with the open and closed parenthesises?

For example

Scanner scan; //scanner reading the source file
String pattern = "..." //ideal pattern I've found to represent an expression
while(scan.hasNext(pattern))
  Interpreter.computeStatement(scan.next(pattern));

would it be a viable option for this problem?

after you read all the file content and store it in a string, then for(string part: fileContent.split("\s+")){/* ... */}. now part stores meaningful characters. notice that you still have to perform lexical analysis to part, but it's easy. — Jason Hu
The default delimiter for Scanner is to match white space, so you're all set to go with out adding any special regex at all. — markspace
@markspace I'm aware that Scanner uses whitespace as delimiter, I was wondering how to make the Scanner recognise only one statement at once. Using Interpreter.computeStatement(scan.next()) should get only a token at once (a parenthesis, a keyword, a number or a variable) — fedexist
The normal method is to lex (tokenize) the input file first, which is what I'm doing with Scanner. Then build a parse tree as you lex, which is what I assumed you wanted to do. Then walk the parse tree to interpret or compile your code. — markspace
Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers. — Ira Baxter

Stephan Stephan · Accepted Answer · 2015-08-27T09:16:14

Solution proposed by Ira Braxter:

Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers.

Maybe this will help: stackoverflow.com/a/2336769/120163

Parsing regular expressions based on a context free grammar

2 Answers