0
votes

Let's say I have a list of regexes like such (this is a simple example, the real code has more complex regexes):

regs = [r'apple', 'strawberry', r'pear', r'.*berry', r'fruit: [a-z]*']

I want to exactly match one of the regexes above (so ^regex$) and return the index. Additionally, I want to match the leftmost regex. So find('strawberry') should return 1 while find('blueberry') should return 3. I'm going to re-use the same set of regexes a lot, so precomputation is fine.

This is what I've coded, but it feels bad. The regex should be able to know which one got matched, and I feel this is terribly inefficient (keep in mind that the example above is simplified, and the real regexes are more complicated and in larger numbers):

import re

regs_compiled = [re.compile(reg) for reg in regs]
regs_combined = re.compile('^' +
                           '|'.join('(?:{})'.format(reg) for reg in regs) +
                           '$')

def find(s):
    if re.match(regs_combined, s):
        for i, reg in enumerate(regs_compiled):
            if re.match(reg, s):
                return i

    return -1

Is there a way to find out which subexpression(s) were used to match the regex without looping explicitly?

1

1 Answers

4
votes

The only way to figure out which subexpression of the regular expression matched the string would be to use capturing groups for every one and then check which group is not None. But this would require that no subexpression uses capturing groups on its own.

E.g.

>>> regs_combined = re.compile('^' +
                           '|'.join('({})'.format(reg) for reg in regs) +
                           '$')
>>> m = re.match(regs_combined, 'strawberry')
>>> m.groups()
(None, 'strawberry', None, None, None)
>>> m.lastindex - 1
1

Other than that, the standard regular expression implementation does not provide further information. You could of course build your own engine that exposes that information, but apart from your very special use case, it’s difficult to make this practically work in other situations—which is probably why this is not provided by existing solutions.