JavaScript Regular Expressions and Capture Groups

0

votes

Im new to regular expressions in JavaScript and am having trouble getting an array of matches from a text string as seen below:

Sentence would go here
-foo
-bar
Another sentence would go here
-baz
-bat

I would like to get an array of matches like this:

match[0] = [
    'foo',
    'bar'
]
match[1] = [
    'baz',
    'bat'
]

So to summarize, I'm looking for is:

"any dash+word (-foo, -bar, etc) that comes AFTER a sentence"

Can anyone supply a formula that would capture all iterations instead of the last one as a repeated capturing group will only capture the last iteration apparently. Forgive me if this is a stupid question. I'm using regex101 if anyone wants to send me some tests

javascriptregex

It may be easier to just iterate over all lines and collect the data as desired. – Felix Kling

Are the hyphens always at the start of the line? – Casimir et Hippolyte

1

votes

Just match two lines starting with - and preceded by a newline if this is sufficient.

\n-(.*)\r?\n-(.*)

See regex demo at regex101. To get matches use exec() method.

var re = /\n-(.*)\r?\n-(.*)/g; var m;

var str = 'Sentence would go here\n-foo\n-bar\nAnother sentence would go here\n-baz\n-bat';

while ((m = re.exec(str)) !== null) {
  if (m.index === re.lastIndex) re.lastIndex++;
  document.write(m[1] + ',' + m[2] + '<br>');
}

2

votes

The first regex I came up with is the following:

/([^-]+)(-\w*)/g

The first group ([^-]+) grabs everything that is not a dash. We then follow that up the actual capture group we want (-\w+). The we add the flag g to make the regular expression object keep track of the last place it looked. This means, each time we run regex.exec(search) we get the next match what you see in regex101.

Note: The \w for JavaScript is equivalent to [a-zA-Z0-9_]. So, if you just want letters use this instead of \w: [a-zA-Z]

Here is the code that implements this regular expression.

<p id = "input">
    Sentence would go here
    -foo
    -bar
    Another sentence would go here
    -baz
    -bat
</p>

<p id = "output">

</p>

<script>
    // Needed in order to make sure did not get a sentence.
    function check_for_word(search) {return search.split(/\w/).length > 1}
    function capture(regex, search) {
        var 
        // The initial match.
            match  = regex.exec(search),
        // Stores all of the results from the search.
            result = [],
        // Used to gather results.
            gather;
        while(match) {
            // Create something empty.
            gather = [];
            // Push onto the gather.
            gather.push(match[2]);
            // Get the next match.
            match = regex.exec(search);
            // While we have more dashes...
            while(match && !check_for_word(match[1])) {
                // Push result on!
                gather.push(match[2]);
                // Get the next match to be checked.
                match = regex.exec(search);
            };
            // Push what was gathered onto the result.
            result.push(gather);
        }
        // Hand back the result.
        return result;
    };
    var output = capture(/([^-]+)(-\w+)/g, document.getElementById("input").innerHTML);
    document.getElementById("output").innerHTML = JSON.stringify(output);
</script>

Using a slightly modified regular expression, you might get more of what you are looking for.

/[^-]+((?:-\w+[^-\w]*)+)/g

The extra bit of [^-\w]* allows for there to be some sort of separation between each dash word. Then the non-capturing group (?:) was added to allow the + one or more of the dashes. We also do not need the () around [^-]+, because the data is no longer needed as you will see below. The first is more flexible as to what can break between dash words, but I find this one a lot cleaner.

function capture(regex, search) {
    var 
	// The initial match.
	    match  = regex.exec(search),
	// Stores all of the results from the search.
	    result = [],
	// Used to gather results.
		gather;
	while(match) {
	    // Create something empty.
	    gather = [];
		
	    // Break up the large match.
	    var temp = match[1].split('-');
		for(var i in temp) 
		{
		    temp[i] = temp[i].split(/\W*/).join("");
			// Makes sure there was actually something to gather.
		    if(temp[i].length > 0)
		        gather.push("-" + temp[i]);
		}
		
		// Push what was gathered onto the result.
		result.push(gather);
		
		// Get the next match.
		match = regex.exec(search);	
	};
	// Hand back the result.
	return result;
};
var output = capture(/[^-]+((?:-\w+[^-\w]*)+)/g, document.getElementById("input").innerHTML);
document.getElementById("output").innerHTML = JSON.stringify(output);

<p id = "input">
Sentence would go here
-foo
-bar
Another sentence would go here
-baz
-bat
My very own sentence!
-get
-all
-of
  -these!
</p>

<p id = "output">

</p>

1

votes

Regexp captures do not really work well with unbounded number of groups. Rather, splitting works better here:

var text = document.getElementById('text').textContent;
var blocks = text.split(/^(?!-)/m);
var result = blocks.map(function(block) {
  return block.split(/^-/m).slice(1).map(function(line) {
      return line.trim();
    });
});
document.getElementById('text').textContent = JSON.stringify(result);

<div id="text">Sentence would go here
-foo
-bar
Another sentence would go here
-baz
-bat
</div>

JavaScript Regular Expressions and Capture Groups

3 Answers