3
votes

I have the following string:

one two three four five six seven eight nine

And I am trying to construct a regular expression that groups the string into three groupings:

  1. Group 1: 'one two three'
  2. Group 2: 'four five six'
  3. Group 3: 'seven eight nine'

I have tried variations of (.*\b(one|two|three)?)(.*\b(four|five|six)?)(.*\b(seven|eight|nine)?) but this pattern splits the full match into one group that contains the full string - the demo can be found here.

Trying (.*\b(one|two|three))(.*\b(four|five|six))(.*\b(seven|eight|nine)) seems to get me closer to what I want but the match information panel shows that the pattern identifies two matches each containing six capture groups.

I am using the OR statement because the groups can be of any length, e.g. two three four, applying the pattern to this string should identify two groups -

  1. Group 1: 'two'
  2. Group 2: 'three four'.
3

3 Answers

2
votes

A large regex that probably does it

(?=.*\b(?:one|two|three|four|five|six|seven|eight|nine)\b)(\b(?:one|two|three)(?:\s+(?:one|two|three))*\b)?.+?(\b(?:four|five|six)(?:\s+(?:four|five|six))*\b)?.+?(\b(?:seven|eight|nine)(?:\s+(?:seven|eight|nine))*\b)?

https://regex101.com/r/rUtkyU/1

Readable version

 (?=
      .* \b 
      (?:
           one
        |  two
        |  three
        |  four
        |  five
        |  six
        |  seven
        |  eight
        |  nine
      )
      \b 
 )
 (                             # (1 start)
      \b   
      (?: one | two | three )

      (?:
           \s+ 
           (?: one | two | three )
      )*
      \b 
 )?                            # (1 end)

 .+? 
 (                             # (2 start)
      \b        
      (?: four | five | six )

      (?:
           \s+ 
           (?: four | five | six )
      )*
      \b     
 )?                            # (2 end)

 .+?   
 (                             # (3 start)
      \b          
      (?: seven | eight | nine )

      (?:
           \s+ 
           (?: seven | eight | nine )
      )*
      \b   
 )?                            # (3 end)
1
votes

This answer assumes that you want to find groups of three number words at a time:

x <- c("one two three four five six seven eight nine")
regexp <- gregexpr("\\S+(?:\\s+\\S+){2}", x)
regmatches(x, regexp)[[1]]

[1] "one two three"    "four five six"    "seven eight nine"

If you want a more general solution, which doesn't require knowing a priori what the length of the input is (i.e. how many groups of three are present), then you might have to use an iterative approach:

parts <- strsplit(x, " ")[[1]]
output <- character(0)
for (i in seq(from=1, to=length(parts), by=3)) {
    output <- c(output, paste(parts[i], parts[i+1], parts[i+2]))
}
output

[1] "one two three"    "four five six"    "seven eight nine"
0
votes

I'm not quite sure what your desired output might be. However, this expression passes and creates several separate capturing groups to be simple to call:

((one|two|three)\s.*?)((four|five|six)\s.*?)((seven|eight|nine)\s.*)

enter image description here

RegEx

If this expression wasn't desired, you can modify/change your expressions in regex101.com.

RegEx Circuit

You can also visualize your expressions in jex.im:

enter image description here

JavaScript Demo

This snippet shows that what various capturing groups might return:

const regex = /((one|two|three)\s.*?)((four|five|six)\s.*?)((seven|eight|nine)\s.*)/gm;
const str = `one two three four five six seven eight nine

two three four six seven eight`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}