0
votes

Trying to write a python regex that will validate a string comprised of

  • Any unicode alphanumeric character (including combining characters)
  • Any number of space characters
  • Any number of underscores
  • Any number of dashes
  • Any number of periods

My test strings:

9 Melodía.de_la-montaña
9 Melodía.de_la-montaña

or as string literals produced with ascii():

str1 = '9 Melod\xeda.de_la-monta\xf1a'
str2 = '9 Melodi\u0301a.de_la-montan\u0303a'

These look identical but aren't, one is normalized and the other uses the combining characters for the inflections.

Here's my first stab:

import re

reg = re.compile("^[\w\.\- ]+$", re.IGNORECASE)
re.search(reg, str1) # None
re.search(reg, str2) # None

If I remove the positional qualifiers and use findall instead of search I get lists like this ['9 Melodi', 'a.de_la-montan', 'a'] or ['9 Melod', 'a.de_la-monta', 'a'].

I've even tried re.compile("^[\w\.\- ]+$", re.IGNORECASE | re.UNICODE) although that should be unnecessary in python 3 right?

In searching for an answer I've found this question and this one and this one and this one but they are all old, deal with python 2, and seem to suggest that the regex I wrote should work. The python 3.5 regex docs mention that \w should match unicode but offer no actual examples involving non-ASCII text.

How do I match the desired strings?

1
Is normalising the string first an option? unicodedata.normalize('NFC', somestr)? The questions you link to don't apply to your situation, not because they are in Python 2 (the regex engine is the basically same between 2 and 3, except re.UNICODE` is now the default), but because they are not trying to match combined characters.Martijn Pieters
Can you please include the print(ascii(str1)) and print(ascii(str2)) versions of the strings too? That way we can trivially copy them without having to worry about using the right encodings.Martijn Pieters
@MartijnPieters I could, and my first test string is normalized, but the regex still isn't matching correctly.Jared Smith
You probably want to switch to the regex library (slated to be moved into the Python stdlib eventually), as it gives you much more expressive power over what is included and what isn't.Martijn Pieters
I can't reproduce the failure for the first case; I get ['9 Melodía.de_la-montaña']. \w matches the Latin-1 codepoints just fine.Martijn Pieters

1 Answers

0
votes

Your first sample, str1, matches just fine; \w includes all Unicode word characters, including Latin characters with accents.

You can normalise your strings to the combined form with unicodedata.normalize(), use the NFC form:

>>> import re
>>> import unicodedata
>>> str1 = '9 Melod\xeda.de_la-monta\xf1a'
>>> str2 = '9 Melodi\u0301a.de_la-montan\u0303a'
>>> reg = re.compile("^[\w\.\- ]+$")
>>> reg.search(str1)
<_sre.SRE_Match object; span=(0, 23), match='9 Melodía.de_la-montaña'>
>>> reg.search(str2) is None
True
>>> reg.search(unicodedata.normalize('NFC', str2))
<_sre.SRE_Match object; span=(0, 23), match='9 Melodía.de_la-montaña'>

Note that the re.IGNORECASE flag is not needed, \w doesn't care about case.