Ignore every token, but the ones in rules

Question

So, I'm trying to make a simple C parser in flex/bison, I only need to parse function and variable declarations, and its uses.

Example, using yyin = fopen() I want to parse this .c file:

#include <stdio.h>
int main()
{
    int firstNumber, secondNumber, sumOfTwoNumbers;

    printf("Enter two integers: ");

    // Two integers entered by user is stored using scanf() function
    scanf("%d %d", &firstNumber, &secondNumber);

    // sum of two numbers in stored in variable sumOfTwoNumbers
    sumOfTwoNumbers = firstNumber + secondNumber;

    // Displays sum
    printf("%d + %d = %d", firstNumber, secondNumber, sumOfTwoNumbers);

    return 0;
}

The parser should detect

int firstNumber, secondNumber, sumOfTwoNumbers;

as variable declaration, and count every use of the variables.

I can already do that, the thing is, I just need to parse those specific cases, and not everything like the "=, +" tokens or //comments, but they can be in the file.

I want a way to be able to maybe, ignore every token but those that match a rule, so that yyparse doesn't call yyerror when it can't recognize a token, so the when the file is parsed I only do an action for function/var declaration and everything else just runs smoothly til the EOF.

There are two ancient web sources where I started (years ago). However, it seems they are still alive: ANSI C grammar, Lex specification and ANSI C Yacc grammar. This is probably not enough for a "C11 standard compiler" but IMHO a good source to start own experiments. (For my own, I stripped it a bit to get rid of K&R C artefacts which I really didn't intend to support.) — Scheff's Cat
Once, I had a working parser the hardest thing (for me) was to master these declarations. For this, A Retargetable C Compiler: Design and Implementation was really a help. — Scheff's Cat

alexsh alexsh · Accepted Answer · 2018-01-29T22:43:22

Your first requirement is relatively easy to satisfy: every variable declaration in C starts with a list of specifiers and qualifiers, so your rule may look something like: var_declaration: spec_or_qual_list identifiers ';' { ... } and a few rules such as

func: 
  spec_or_qual_list identifier '(' well_nested_tokens ')'';' {...}
| spec_or_qual_list identifier 
  '(' well_nested_tokens ')''{' body_tokens '}' {...}
| other_token well_nested_tokens ';' { ... process uses ... } 

body_tokens:
  %empty                        {...}
| body_tokens var_declaration {...}
| body_tokens func             {...}

Note that the above simple rules would only parse 'flat' C files like the ones you gave in your question (a list of functions and declarations with no structure declarations or blocks other that the bodies of the functions).

Once you decide to add blocks you would have to add rules similar to the one for func above so that you can recursively descent into blocks and parse the declarations there. You cannot simply ignore them since then you would miss the beginning of a declaration that may come after the block. If you let structures in, you have a much harder task of distinguishing between structure members and variables (or maybe you do not want to in which case structures can be handled as blocks, syntactically only, as the scoping rules are very different). Note that all these approaches ignore scoping rules so, in your case, you will not be able to track the association between the variable and its declaration (since a variable can be redeclared inside a block as something completely different). Finally, if you allow typedef names, all hell will break loose (look up 'lexer hack') and you might as well just write a full parser.

The moral of all this is the following: if the files you are trying to parse have any kind of non trivial structure you will end up parsing a significant portion of the C grammar, anyway. One other simplifying trick I can suggest is writing a (full) parser for C expressions and using it to parse C declarators (such as (* pfunc)(int)). This would not work for all C11 declarators but would handle most 'ordinary' ones. You still need to handle blocks though.

Ignore every token, but the ones in rules

1 Answers