Find multiple Objective-C comments per file, in certain format, with Ruby Regex

Question

I'm writing a Ruby script that uses regex to find all comments of a specific format in Objective-C source code files.

The format is

/* <Headline_in_caps> <#>:
    <Comment body>
**/

I want to capture the headline in caps, the number and the body of the comment.

With the regex below I can find one comment in this format within a larger body of text.

My problem is that if there are more than one comments in the file then I end up with all the text, including code, between the first /* and last **/. I don't want it to capture all text inclusively, but only what is within each /* and **/.

The body of the comment can include all characters, except for **/ and */ which both signify the end of a comment. Am I correct assuming that regex will find multiple-whole-regex-matches only processing text once?

\/\*\s*([A-Z]+). (\d)\:([\w\d\D\W]+)\*{2}\//x

Broken apart the regex does this:

\/\* —finds the start of a comment

\s* —finds whitespace

([A-Z]+) —captures caps word

.<space> —find the space in between caps word and digit

(\d) —capture the digit

\: —find the colon

([\w\W\d\D]+) —captures the body of a message which can include all valid characters, except **/ or */

\*{2}\/ —finds the end of a comment

Here is a sample, everything from the first /* to the second **/ is captured.:

/*

 HEADLINE 1:

 Comment body.

 **/

- (BOOL)application:(UIApplication *)application didFinishLaunchingWithOptions:(NSDictionary *)launchOptions
{
// This text and method declaration are captured
// The regex captures from HEADLINE to the end of the comment "meddled in." inclusively.

/*
       HEADLINE 2:

       Should be captured separately and without Objective-C code meddled in. 
 **/

}

Here is the sample on Rubular: http://rubular.com/r/4EoXXotzX0

I'm using gsub to process the regex on a string of the whole file, running Ruby 1.9.3. Another issue I have is that gsub gives me what Rubular ignores, is this a regression or is Rubular using a different method that gives what I want?

In this question Regex matching multiple occurrences per file and per line about multiple occurrences the answer is to use g for the global option, that is not valid in Ruby regex.

Phrogz Phrogz · Accepted Answer · 2012-01-20T21:06:53

Change this: ([\w\W\d\D]+)
To this: ([\w\W\d\D]+?)

This will cause the regex to be non-greedy, stopping as soon as it sees the next closing **/. (Updated rubular: http://rubular.com/r/Whm31AJ6Kg)

Also, note that [\w\W\d\D] matches absolutely any character, and can be simpler written as just [\w\W]. You could alternatively match the body with just [^*\/], which would also avoid the above problem of matching through the close. (Updated rubular: http://rubular.com/r/2h0kGYkdVQ)

Find multiple Objective-C comments per file, in certain format, with Ruby Regex

2 Answers