3
votes

I am curious about parsing C++ code using regexp. What I have so far (using ruby) allows me to extract class declarations and their parent classes (if any):

/(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\s*\{/

Here is an example in Rubular. Notice I can capture correctly the "declaration" and "inheritance" parts.

The point at where I am stuck is at capturing the class body. If I use the following extension of the original regex:

/(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\s*\{[^}]*\};/

Then I can capture the class body only if it does not contain any curly braces, and therefore any class or function definition. At this point I have tried many things but none of them make this better. For instance, if I include in the regexp the fact that the body can contain braces, it will capture the first class declaration and then all the subsequent classes as if they were part of the first class' body!

What am I missing?

3

3 Answers

3
votes

Regular expressions are not the recommended way to parse code.

Most compilers and interpreters use lexers and parsers to convert code into an abstract syntax tree before compiling or running the code.

Ruby has a few lexer gems, like this, you can try and incorporate into your project.

1
votes

The group capturing might help:

#                   named  v    backref          v
/(struct|class)\s+(?<match>{((\g<match>|[^{}]*))*})/m

Here we find the matching curly bracket for the one following struct/class declaration. You probably will want to tune the regexp, I posted this to make the solution as clear as possible.

0
votes

What I can offer you is this:

(struct|class)\s+([^{:\s]+)\s*[:]?([^{]+)\{([^{}]|\{\g<4>\})*\};

Where \g<4> is a recursive application of the fourth capture group, which is ([^{}]|\{\g<4>\}).

Matching non-regular languages with regular expressions is never pretty. You might want to consider switching to a proper recursive descent parser, especially if you plan to do something with the stuff you just captured.