6
votes

I asked a question a little while ago (Python splitting unknown string by spaces and parentheses) which worked great until I had to change my way of thinking. I have still not grasped regex so I need some help with this.

If the user types this:

new test (test1 test2 test3) test "test5 test6"

I would like it to look like the output to the variable like this:

["new", "test", "test1 test2 test3", "test", "test5 test6"]

In other words if it is one word seperated by a space then split it from the next word, if it is in parentheses then split the whole group of words in the parentheses and remove them. Same goes for the quotation marks.

I currently am using this code which does not meet the above standard (From the answers in the link above):

>>>import re
>>>strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
>>>[", ".join(x.split()) for x in re.split(r'[()]',strs) if x.strip()]
>>>['Hello', 'Test1, test2', 'Hello1, hello2', 'other_stuff']

This works well but there is a problem, if you have this:

strs = "Hello Test (Test1 test2) (Hello1 hello2) other_stuff"

It combines the Hello and Test as one split instead of two.

It also doesn't allow the use of parentheses and quotation marks splitting at the same time.

5
have a look at greedy and non-greedy matching.XORcist
@möter Do you have a link to lead me to a tutorial? Most everything I find are questions about it that don't really help me and I can't read the python docs to well. If that's all that's left it will have to do.TrevorPeyton
Sorry, I misread the question. But here's a link to the official tutorial: docs.python.org/2/library/re.htmlXORcist

5 Answers

5
votes

The answer was simply:

re.findall('\[[^\]]*\]|\([^\)]*\)|\"[^\"]*\"|\S+',strs)
2
votes

This is pushing what regexps can do. Consider using pyparsing instead. It does recursive descent. For this task, you could use:

from pyparsing import *
import string, re

RawWord = Word(re.sub('[()" ]', '', string.printable))
Token = Forward()
Token << ( RawWord | 
           Group('"' + OneOrMore(RawWord) + '"') |
           Group('(' + OneOrMore(Token) + ')') )
Phrase = ZeroOrMore(Token)

Phrase.parseString(s, parseAll=True)

This is robust against strange whitespace and handles nested parentheticals. It's also a bit more readable than a large regexp, and therefore easier to tweak.

I realize you've long since solved your problem, but this is one of the highest google-ranked pages for problems like this, and pyparsing is an under-known library.

1
votes

Your problem is not well defined.

Your description of the rules is

In other words if it is one word seperated by a space then split it from the next word, if it is in parentheses then split the whole group of words in the parentheses and remove them. Same goes for the commas.

I guess with commas you mean inverted commas == quotation marks.

Then with this

strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"

you should get that

["Hello (Test1 test2) (Hello1 hello2) other_stuff"]

since everything is surrounded by inverted commas. Most probably, you want to work with no care of largest inverted commas.

I propose this, although a bot ugly

import re, itertools
strs = raw_input("enter a string list ")

print [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x) 
        for x in re.split(r'\((.*)\)', strs)])) 
        if y <> '']

gets

>>> 
enter a string list here there (x y ) thereagain "there there"
['here there ', 'x y ', ' thereagain ', 'there there']
1
votes

This is doing what you expect

import re, itertools
strs = raw_input("enter a string list ")

res1 = [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x) 
        for x in re.split(r'\((.*)\)', strs)])) 
        if y <> '']

set1 = re.search(r'\"(.*)\"', strs).groups()
set2 = re.search(r'\((.*)\)', strs).groups()

print [k for k in res1 if k in list(set1) or k in list(set2) ] 
   + list(itertools.chain(*[k.split() for k in res1 if k 
   not in set1 and k not in set2 ]))
0
votes

For python 3.6 - 3.8

I had a similar question, however I like none of those answers, maybe because most of them are from 2013. So I elaborated my own solution.

regex = r'\(.+?\)|".+?"|\w+' 
test = 'Hello Test (Test1 test2) (Hello1 hello2) other_stuff'
result = re.findall(regex, test)

Here you are looking for three different groups:

  1. Something that is included inside (); parenthesis should be written together with backslashes
  2. Something that is included inside ""
  3. Just words
  4. The use of ? makes your search lazy instead of greedy