Trying to write a python regex that will validate a string comprised of
- Any unicode alphanumeric character (including combining characters)
- Any number of space characters
- Any number of underscores
- Any number of dashes
- Any number of periods
My test strings:
9 Melodía.de_la-montaña
9 Melodía.de_la-montaña
or as string literals produced with ascii()
:
str1 = '9 Melod\xeda.de_la-monta\xf1a'
str2 = '9 Melodi\u0301a.de_la-montan\u0303a'
These look identical but aren't, one is normalized and the other uses the combining characters for the inflections.
Here's my first stab:
import re
reg = re.compile("^[\w\.\- ]+$", re.IGNORECASE)
re.search(reg, str1) # None
re.search(reg, str2) # None
If I remove the positional qualifiers and use findall
instead of search
I get lists like this ['9 Melodi', 'a.de_la-montan', 'a']
or ['9 Melod', 'a.de_la-monta', 'a']
.
I've even tried re.compile("^[\w\.\- ]+$", re.IGNORECASE | re.UNICODE)
although that should be unnecessary in python 3 right?
In searching for an answer I've found this question and this one and this one and this one but they are all old, deal with python 2, and seem to suggest that the regex I wrote should work. The python 3.5 regex docs mention that \w
should match unicode but offer no actual examples involving non-ASCII text.
How do I match the desired strings?
unicodedata.normalize('NFC', somestr)? The questions you link to don't apply to your situation, not because they are in Python 2 (the regex engine is the basically same between 2 and 3, except
re.UNICODE` is now the default), but because they are not trying to match combined characters. – Martijn Pietersprint(ascii(str1))
andprint(ascii(str2))
versions of the strings too? That way we can trivially copy them without having to worry about using the right encodings. – Martijn Pietersregex
library (slated to be moved into the Python stdlib eventually), as it gives you much more expressive power over what is included and what isn't. – Martijn Pieters['9 Melodía.de_la-montaña']
.\w
matches the Latin-1 codepoints just fine. – Martijn Pieters