1
votes

I'm trying to use Python 2.7's regular expression module to match all of the words in an NFKD normalized Unicode string. My understanding is that the re.UNICODE flag adds Unicode support to the \w expression, but I am not having any success with it.

>>> s = u'ca\u0308t'
>>> print s
cät
>>> pattern = re.compile(ur'\w+', flags=re.UNICODE)
>>> pattern.findall(s)
[u'ca', u't']

I suppose this is because \u0308 is not considered alphanumeric, and so does not match with \w. The pattern matches with NFKC normalized Unicode:

>>> s
u'ca\u0308t'
>>> import unicodedata 
>>> r = unicodedata.normalize('NFKC', s)
>>> pattern.findall(r)
[u'c\xe4t']

It would be nice if using re.UNICODE would make the parser consider \u0061\u0308 equivalent to \u00E4. Is there anything I am doing wrong, or not aware of?

I will just use NFKC if nothing in the standard library can help. Thank you!

For information about Unicode normalization forms: http://unicode.org/reports/tr15/

Edit: I just found that this question has been asked before: Python regex \w doesn't match combining diacritics?

It looks like the best solution is to use regex instead of re

1
Side note for clarity: In my code example I'm seeing the accent appear over the t in cat, but it should be appearing over the a, as in ä. While editing the post the accent appears over the a, there's something odd going on.doykle
caU+0308t is matched by standard \w in Unicode mode. Where U+063U+0308 is considered one character.user557597
Btw, why are you using an assignment as a parameter flags=re.UNICODE ? That might just return 1. Try it with just re.UNICODEuser557597
@sln in Python you can assign arguments to parameters explicitly. Here flags is the name of the parameter. I can use just re.UNICODE and have the same result. Regarding your first comment, that is the behavior I was expecting but not the behavior I observed.doykle
I've tried this in boost and it works in Unicode mode. Have you tried the replacement engine regex instead of re ?user557597

1 Answers

0
votes

You can use this: \S+

\S — All except whitespace

Exemple:

>>> re.compile(ur'\S+', flags=re.UNICODE).findall(u'ca\u0308t')
[u'ca\u0308t']