can't get correct python regex with this string that contains unicode

Question

I have this string:

s = u'vitamin a min. 14,053 iu/kg   vitamin c 13,000iu/kg vitamin d max. 10,000\u03bc/kg copper 1mg/kg vitamin e mon 10.00iu/kg'

I want to break it apart so I get [label, label2, amount, units].

label is the name ie, vitamin c and can contain unicode characters
label2 is min|max depending on the particular string.
amount is the numerical amount listed (can include commas or decimals)
units can include unicode characters

You can see some edge cases cropping up already:

Some ingredients don't contain a label2 = min or max (see copper and vitamin c). The regex grouping can be None in this case.
Some ingredients have label2 mispelled (see vitamin e uses mon)
There may be unicode in the units

Ideally, I would like a regex that can match against individual ingredients as well as a messy list (like I have provided).

I came up with:

import re
regex = re.compile(ur'([a-z 0-9]+)(min|max|mon)?[. ]+([0-9., ]+)((?=[%])|[a-z/]+|[^\W\d_]+/[^\W\d_]+)', re.UNICODE)


re.findall(regex, s)

# [(u'vitamin a min', u'', u'14,053 ', u'iu/kg'), (u'   vitamin c', u'', u'13,000', u'iu/kg'), (u' vitamin d max', u'', u'10,000', u'\u03bc/kg'), (u' copper', u'', u'1', u'mg/kg'), (u' vitamin e mon 10', u'', u'00', u'iu/kg')]

re.findall(regex, u'vitamin a min. 14,053 iu/kg')

# [(u'vitamin a min', u'', u'14,053 ', u'iu/kg')]

This matches nearly everything, but you can see some problems.

the label is matching the min,max and label2 matches nothing.
I don't like hardcoding (min|max|mon) because there could be a case where the word is misspelled to something else and that hardcoding won't catch it.

ekhumoro ekhumoro · Accepted Answer · 2014-12-04T19:02:32

You can deal with splitting the two labels by using non-greedy matching.

But there is no way to avoid hardcoding the second label. Even without the spelling variants, you would have to at least specifiy (min|max), otherwise there would be no reason to separate the second label from the first label (which can have any number of words). So the best you can do is extend that list with whatever other variants you can find in the data (there probably aren't all that many).

Anyway, here's one possible solution that works with the example data you've provided:

>>> regex = re.compile(ur"""
...     ((?:\w+\s+)+?)((?:min|max|mon)\.?)?
...     ([0-9., ]+)(%|[^\W\d_]+/[^\W\d_]+)
...     """, re.X | re.I | re.U)
>>> pprint(regex.findall(s))
[(u'vitamin a ', u'min.', u' 14,053 ', u'iu/kg'),
 (u'vitamin c ', u'', u'13,000', u'iu/kg'),
 (u'vitamin d ', u'max.', u' 10,000', u'\u03bc/kg'),
 (u'copper ', u'', u'1', u'mg/kg'),
 (u'vitamin e ', u'mon', u' 10.00', u'iu/kg')]

can't get correct python regex with this string that contains unicode

1 Answers