31
votes

While matching an email address, after I match something like yasar@webmail, I want to capture one or more of (\.\w+)(what I am doing is a little bit more complicated, this is just an example), I tried adding (.\w+)+ , but it only captures last match. For example, [email protected] matches but only include .tr after yasar@webmail part, so I lost .something and .edu groups. Can I do this in Python regular expressions, or would you suggest matching everything at first, and split the subpatterns later?

4
Capturing repeated expressions was proposed in Python Issue 7132 but rejected. It is however supported by the third-party regex module. - Todd Owen
@ToddOwen But, isn't this now possible in 2.7? I don't know when it became possible. But, the answer from stackoverflow.com/a/9765037/3541976 seems to work just fine for me in 2.7 using the re module. - Michael Ohlrogge
@MichaelOhlrogge Issue 7132 is about what happens if the capturing parentheses are inside a repeat. The issue is not fixed, and will still only keep the last match. A possible workaround, as mentioned in the answer you linked to, is to put the capturing parentheses around a repeating pattern. (Note that (?: ...) are not capturing parentheses). - Todd Owen
@ToddOwen Got it, thank you, that is a helpful clarification! - Michael Ohlrogge

4 Answers

34
votes

re module doesn't support repeated captures (regex supports it):

>>> m = regex.match(r'([.\w]+)@((\w+)(\.\w+)+)', '[email protected]')
>>> m.groups()
('yasar', 'webmail.something.edu.tr', 'webmail', '.tr')
>>> m.captures(4)
['.something', '.edu', '.tr']

In your case I'd go with splitting the repeated subpatterns later. It leads to a simple and readable code e.g., see the code in @Li-aung Yip's answer.

14
votes

This will work:

>>> regexp = r"[\w\.]+@(\w+)(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?(\.\w+)?"
>>> email_address = "[email protected]"
>>> m = re.match(regexp, email_address)
>>> m.groups()
('galactica', '.caprica', '.fleet', '.mil', None, None)

But it's limited to a maximum of six subgroups. A better way to do this would be:

>>> m = re.match(r"[\w\.]+@(.+)", email_address)
>>> m.groups()
('galactica.caprica.fleet.mil',)
>>> m.group(1).split('.')
['galactica', 'caprica', 'fleet', 'mil']

Note that regexps are fine so long as the email addresses are simple - but there are all kinds of things that this will break for. See this question for a detailed treatment of email address regexes.

8
votes

You can fix the problem of (\.\w+)+ only capturing the last match by doing this instead: ((?:\.\w+)+)

5
votes

This is what you are looking for:

>>> import re

>>> s="[email protected]"
>>> r=re.compile("\.\w+")
>>> m=r.findall(s)

>>> m
['.something', '.edu', '.tr']