0
votes

Suppose I have the following string: anything/AAA/B/B/B/anything

and I want to match anything/, AAA/, B/B/B/, anything in 4 different groups.

AAA/ and B/B/B/ are optional and anything can be any string

so the result for the following string anything/AAA/B/B/B/anything will be
group1: anything/
group2: AAA/
group3: B/B/B/
group4: anything

and the result for the following string anything/anything will be
group1: anything/
group2: empty
group3: empty
group4: anything

I have tried the following regular expression: ^(.*?/)(AAA/)?(B/B/B/)?(.*?)$

The problem is that when the first anything contain / the optional groups are not captured

so the result for the following string any/thing/AAA/B/B/B/anything will be
group1: any/
group2: empty
group3: empty
group4: thing/AAA/B/B/B/anything

and I want it to be like this:
group1: any/thing
group2: AAA/
group3: B/B/B/
group4: anything

Any help will be appreciated

1
What would you want the groups to be if the input was any/thing/anything? - CAustin
I prefer this: group1: any/thing group2: empty group3: empty group4: anything but it can also be this: group1: any/ group2: empty group3: empty group4: thing/anything - NirG

1 Answers

0
votes

Your problem is that 'anything' can be, well, anything.

So, when you make it greedy, it can match everything up to and including 'AAA/' or 'B/B/B/'. But when you make it not greedy (like in your example), it will try to match as little as possible for that first matching group and return the result of that if it can make it work - and it can, by just matching the rest to the final 'anything', i.e. (.*?). Even though that's not greedy, that only applies to the end of the string and once it reaches the end of the subject string without breaking the rules, it's done.

You might think that matching 'AAA/' or 'B/B/B/' to separate groups makes the final group even 'less greedy', but the regex engine won't walk all possible matches and give you the 'least greedy' on, it will return the first match it can find.

So, I don't think you can get what you want in one go with the freedom of having 'anything' both in the front and the back - although I'd love to be proven wrong.

Depending on your language (example given in Python), you could just do a few matches in a row:

import re


def get_matches(s):
    match = re.search(r'^(.*?/)(AAA/)(B/B/B/)(.*?)$', s)
    if not match:
        match = re.search(r'^(.*?/)(AAA/)()(.*?)$', s)
        if not match:
            match = re.search(r'^(.*?/)()(B/B/B/)(.*?)$', s)
            if not match:
                if not match:
                    match = re.search(r'^(.*?/)()()(.*?)$', s)
    return match


print(get_matches('anything/AAA/B/B/B/anything').groups())
print(get_matches('anything/AAA/anything').groups())
print(get_matches('anything/B/B/B/anything').groups())
print(get_matches('anything/anything').groups())

Result:

('anything/', 'AAA/', 'B/B/B/', 'anything')
('anything/', 'AAA/', '', 'anything')
('anything/', '', 'B/B/B/', 'anything')
('anything/', '', '', 'anything')

That will end up with match always having the same matches at the same group indices, but I don't really like the solution. If you state why you're trying to match this, I'm fairly sure there's a better way to achieve the goal than this.