A simplest lex/yacc set of rules to extract function-call-like statements

Question

I have no practical experience in lex/yacc so my question may look naive but I could not figure out a reasonable solution with all information I found in the stackoverflow and internet. Suppose I need a parser for C/C++-like syntax but everything I need are function-call-like statements like foo(a), bar(1, 2), foobar("x", a, (b, c)), etc. I am not interested in validity of the code and expressions; I am ready to consider them as just a text/sequence of symbols. I need to recognize string literals, identifiers and expressions between commas and parentheses (just a sequence of symbols). Well, I need to drop comments and to recognize preprocessor directives this is outside the question scope.

I am not familiar enough with lex/yacc, but I am a software engineer with, let's say, some experience. In the past, I wrote such a parser in C++ without any 3rd-party helpers/tools. Not in five minutes, but I wouldn't say it was a big deal. Nevertheless, it's a piece of code to manage. So next time I need it, I thought that using lex/yacc can be a good idea. Definitely, a solution for such a primitive task should be even more primitive with tools specialized for grammars. Apparently, I spent more time (unsuccessfully) trying to get something our from lex/yacc than I would need to write the parser completely manually.

Let's say my lex produces identifiers, string_literals, ',', '(', ')' and symbols (all the rest). It removes comments and preprocessor stuff. So I would like to say in yacc something like

expression_element
  : IDENTIFIER
  | STRING_LITERAL
  | SYMBOL
  | list_expression
  ;

list_expression
  : '(' ')'
  | '(' expression ')'
  | '(' expression_list ')'
  ;

expression_list
  : expression ',' expression
  | expression_list ',' expression
  ;

expression
  : expression_element
  | IDENTIFIER list_expression
  { /* And this is what I really need. */ }
  | expression expression_element
  ;

Well, I believe this could be written i another way and maybe even simpler. As I said, I do not mind validity of expressions, I mind only their integrity regarding ',', '(' and ')', and simplicity. Now, the real problem I cannot resolve is: whatever I do, I cannot force them to distinguish between 'IDENTIFIER list_expression' (taking precedence) and 'IDENTIFIER NOT-list-expression' where the first is the function-call-like statement I need and the second is just IDENTIFIER on its own as a part of anything else (statement).Anything I tried leads to conflicts and following parsing errors only.

Is there anything simple I miss? Or I need to create a gory grammar for such a small staff? Or I just need another tool (recommendations?)? I would prefer to avoid writing parser by myself unless this is an only simple solution...

I think you are going to find this very difficult without writing a parser for the complete grammar. You will keep discovering edge cases forever that will require you to parse rather than just skip them: and/or the skipping will become just as complex as the parsing, or more so, and less reliable. — user207421
To @marquis-of-lorne - this does not look so after the given answer. This reply resolved the issue at once.Note that I did not mean "function calls", I meant "function-call-like" in the sense that real function calls,. function declarations, macro definition and everything alike are the same. — Mtm 3.14

rici rici · Accepted Answer · 2020-06-07T02:40:32

You can resolve the conflict in your existing solution by moving | IDENTIFIER list_expression to expression_element, and adding

%left IDENTIFIER
%left '('

to your prologue, in order to force IDENTIFIER list_expression to take precedence over concatenating an IDENTIFIER with a list_expression as part of an expression.

The use of %left in the precedence declarations is entirely arbitrary, since associativity here is irrelevant. Precedence declarations are only used to resolve ambiguity, and the parses for IDENTIFIER IDENTIFIER and ( ( are not ambiguous; in both cases, only one parse is possible. It's only with the case of IDENTIFIER ( that there is an ambiguity: that could be the start of the right-hand side IDENTIFIER list_expression, or it could be an expression_element consisting of an IDENTIFIER followed by an expression_element consisting of a list_expression. To force the first interpretation, we need to make sure the ambiguity is resolved in favour of shifting the (.

In the absence of a relevant precedence declaration, yacc/bison will resolve an ambiguity in favour of the shift action. But it also produces a warning. Since the precedence declaration produces this same default action, the generated parser will be the same with or without the modification; the only effect of the declaration is to suppress the warning message.

I don't know if that's really the answer to your problem, since there aren't enough specifics to know what problem you are actually attempting to solve. If the program code you're looking at is really C, you'll find that expressions of the form (function)(3) will not be identified as function calls, although they quite possibly are. (They might well be cast expressions though; it depends on the semantics of what's inside the parentheses.) Function calls of this form are used in order to avoid invoking a function-like macro of the same name, but they might also be used just to satisfy a particular syntactic aesthetic of the coder. Distinguishing between casts and function calls is not possible without a lot of parsing infrastructure (you need to parse typedef declarations at least up to the point of figuring out what the name of the alias is, for example).

A simplest lex/yacc set of rules to extract function-call-like statements

1 Answers