2
votes

I'm doing a pretty straightforward regex in python and seeing some odd behavior when I use the "or" operator.

I am trying to parse the following:

>> str = "blah [in brackets] stuff"

so that it returns:

>> ['blah', 'in brackets', 'stuff']

To match the text between brackets, I am using look behind and look ahead, i.e.:

>> '(?<=\[).*?(?=\])'

If used alone this does indeed capture the text in brackets:

>> re.findall( '(?<=\[).*?(?=\])' , str )
>> ['in brackets']

But when I combine the or operator to parse the strings between spaces, the bracket-match somehow breaks down:

>> [x for x in re.findall( '(?<=\[).*?(?=\])|.*?[, ]' , str ) if x!=' ' ] 
>> ['blah', '[in ', 'brackets] ']

For the life of me I can't understand this behavior. Any help would be appreciated.

Thanks!

4
This might help - regex101.com/r/xM7sK0/1 - on the left you can go into the debugger where it will explain how it matched the things it did.TessellatingHeckler
Thanks, that is really useful.FrancisWolcott
The problem is that the 2nd half of the regex also matches brackets. After the first match ("blah "), the remaining text is [in brackets] stuff. The first half of the regex doesn't match here because the lookbehind doesn't find an opening bracket. So the 2nd half of the regex matches again and finds the text "[in ".Aran-Fey
Ah I see. Thank you Rawing!FrancisWolcott

4 Answers

2
votes

You can do:

>>> s = "blah [in brackets] stuff"

>>> re.findall(r'\b\w+\s*\w+\b', s)
['blah', 'in brackets', 'stuff']
2
votes

For those interested, this is the successful regex that I ended up going with. There is probably a more elegant solution somewhere but this works:

>>> s = "blah 2.0 stuff 1 1 0 [in brackets] more stuff [1]"

>>> brackets_re = '(?<=\[).*?(?=\])'
>>> space_re = '[-\.\w]+(?= )'
>>> my_re = brackets_re + '|' + space_re

>>> re.findall(my_re, s)
['blah', '2.0', 'stuff', '1', '1', '0', 'in brackets', 'more', 'stuff', '1']
0
votes

If you are looking for an easy way to do this, then use this. Note : I replaced str with string as 'str' is a built-in function of python.

import re
string = "blah [in brackets] stuff"
f = re.findall(r'\w+\w', string)
print(f)

Output: ['blah', 'in brackets', 'stuff']

0
votes

The answers so far don't take into account that you may have more than 2 words inside the brackets, or even one word. The following regex will split on the brackets and any leading or trailing white space of the brackets. It will also work if there are more bracketed content in the string.

s = "blah [in brackets] stuff"

s = re.split(r'\s*\[|\]\s*', s) # note the 'or' operator is used and literal opening and closing brackets '\[' and '\]'

print(s)

output: ['blah', 'in brackets', 'stuff']

And an example using a string with different amounts of words inside brackets and using several sets of brackets:

s = "blah [in brackets] stuff [three words here] more stuff [one-word] stuff [a digit 1!] stuff."

s = re.split(r'\s*\[|\]\s*', s)

print (s)

output: ['blah', 'in brackets', 'stuff', 'three words here', 'more stuff', 'one-word', 'stuff', 'a digit 1!', 'stuff.']