I have this string:
s = u'vitamin a min. 14,053 iu/kg vitamin c 13,000iu/kg vitamin d max. 10,000\u03bc/kg copper 1mg/kg vitamin e mon 10.00iu/kg'
I want to break it apart so I get [label, label2, amount, units].
labelis the name ie,vitamin cand can contain unicode characterslabel2ismin|maxdepending on the particular string.amountis the numerical amount listed (can include commas or decimals)unitscan include unicode characters
You can see some edge cases cropping up already:
- Some ingredients don't contain a
label2=minormax(seecopperandvitamin c). The regex grouping can beNonein this case. - Some ingredients have
label2mispelled (seevitamin eusesmon) - There may be unicode in the units
Ideally, I would like a regex that can match against individual ingredients as well as a messy list (like I have provided).
I came up with:
import re
regex = re.compile(ur'([a-z 0-9]+)(min|max|mon)?[. ]+([0-9., ]+)((?=[%])|[a-z/]+|[^\W\d_]+/[^\W\d_]+)', re.UNICODE)
re.findall(regex, s)
# [(u'vitamin a min', u'', u'14,053 ', u'iu/kg'), (u' vitamin c', u'', u'13,000', u'iu/kg'), (u' vitamin d max', u'', u'10,000', u'\u03bc/kg'), (u' copper', u'', u'1', u'mg/kg'), (u' vitamin e mon 10', u'', u'00', u'iu/kg')]
re.findall(regex, u'vitamin a min. 14,053 iu/kg')
# [(u'vitamin a min', u'', u'14,053 ', u'iu/kg')]
This matches nearly everything, but you can see some problems.
the
labelis matching themin,maxandlabel2matches nothing.I don't like hardcoding
(min|max|mon)because there could be a case where the word is misspelled to something else and that hardcoding won't catch it.