Case insensitive regex for non-English characters

Question

I need to perform a regular expression match on text that includes non-English characters (Spanish, French, German, and Russian).

I want the match to ignore case, so with English characters I would just use the /i modifier, but that doesn't work with words like übermäßig.

What is the simplest way to write a regex that will match both, say, übermäßig and ÜBERMÄßig? And can the same approach be used to convert upper case non-English letters to their lowercase equivalents in Perl?

You do know that the correct uppercase version of übermäßig is ÜBERMÄSSIG, right? — Tim Pietzcker
Actually, no. I don't speak German. Wikipedia seemed to indicate that there is no uppercase letter for ß, but I guess I misunderstood. I just checked if /ß/i matches SS, and it doesn't. Do you know how I can accomplish this? — itzy
There is an uppercase "ß" in Unicode ("ẞ", U+1E9E), but the proper uppercasing of "ß" is "SS" (according to Unicode). In fact, uc(lc("ẞ")) returns SS — ikegami

ikegami ikegami · Accepted Answer · 2012-10-17T15:08:17

It works perfectly fine

$ perl -E'use utf8; say "ÜBERMÄẞIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match

$ perl -E'use utf8; say "ÜBERMÄSSIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match

(The use utf8; says the source code is encoded using UTF-8. It would be impossible to have those characters in the script any other way.)

I suspect an encoding problem, meaning you think you gave Perl "ß" when you didn't. It could also be that you're using an older version of Perl that can't handle multi-char folds correctly. Generally speaking, it could help to use /u, but it shouldn't make a difference for this example.

Case insensitive regex for non-English characters

5 Answers