I'm trying to use Python 2.7's regular expression module to match all of the words in an NFKD normalized Unicode string. My understanding is that the re.UNICODE
flag adds Unicode support to the \w
expression, but I am not having any success with it.
>>> s = u'ca\u0308t'
>>> print s
cät
>>> pattern = re.compile(ur'\w+', flags=re.UNICODE)
>>> pattern.findall(s)
[u'ca', u't']
I suppose this is because \u0308
is not considered alphanumeric, and so does not match with \w
. The pattern matches with NFKC normalized Unicode:
>>> s
u'ca\u0308t'
>>> import unicodedata
>>> r = unicodedata.normalize('NFKC', s)
>>> pattern.findall(r)
[u'c\xe4t']
It would be nice if using re.UNICODE
would make the parser consider \u0061\u0308
equivalent to \u00E4
. Is there anything I am doing wrong, or not aware of?
I will just use NFKC if nothing in the standard library can help. Thank you!
For information about Unicode normalization forms: http://unicode.org/reports/tr15/
Edit: I just found that this question has been asked before: Python regex \w doesn't match combining diacritics?
It looks like the best solution is to use regex
instead of re
t
incat
, but it should be appearing over thea
, as inä
. While editing the post the accent appears over thea
, there's something odd going on. – doykle\w
in Unicode mode. Where U+063U+0308 is considered one character. – user557597flags=re.UNICODE
? That might just return 1. Try it with justre.UNICODE
– user557597flags
is the name of the parameter. I can use justre.UNICODE
and have the same result. Regarding your first comment, that is the behavior I was expecting but not the behavior I observed. – doykleregex
instead ofre
? – user557597