2
votes

I need to perform a regular expression match on text that includes non-English characters (Spanish, French, German, and Russian).

I want the match to ignore case, so with English characters I would just use the /i modifier, but that doesn't work with words like übermäßig.

What is the simplest way to write a regex that will match both, say, übermäßig and ÜBERMÄßig? And can the same approach be used to convert upper case non-English letters to their lowercase equivalents in Perl?

5
You do know that the correct uppercase version of übermäßig is ÜBERMÄSSIG, right?Tim Pietzcker
Actually, no. I don't speak German. Wikipedia seemed to indicate that there is no uppercase letter for ß, but I guess I misunderstood. I just checked if /ß/i matches SS, and it doesn't. Do you know how I can accomplish this?itzy
There is an uppercase "ß" in Unicode ("ẞ", U+1E9E), but the proper uppercasing of "ß" is "SS" (according to Unicode). In fact, uc(lc("ẞ")) returns SSikegami

5 Answers

4
votes

It works perfectly fine

$ perl -E'use utf8; say "ÜBERMÄẞIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match

$ perl -E'use utf8; say "ÜBERMÄSSIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match

(The use utf8; says the source code is encoded using UTF-8. It would be impossible to have those characters in the script any other way.)

I suspect an encoding problem, meaning you think you gave Perl "ß" when you didn't. It could also be that you're using an older version of Perl that can't handle multi-char folds correctly. Generally speaking, it could help to use /u, but it shouldn't make a difference for this example.

2
votes

The /i modifier works nicely if the strings use Perl's internal encoding.

For example, this prints "yes":

perl -le 'use utf8; print "yes" if "ÜBERMäßig" =~ /überMÄßiG/i'

The "use utf8" tells Perl that my source code is encoded in UTF-8, and therefore Perl decodes all literal strings in my source code from UTF-8 into its internal encoding. This example will not work without use utf8.

If your strings come from somewhere else then you may need to apply Encode::decode -- or tell your source to generate properly decoded strings (e.g. possible with most DBI drivers).

1
votes

It works for me. Do you need to use utf8;, maybe?

(Disclaimer: I don't know Perl.)

1
votes

If you set the locale to the appropriate value in your Perl script, then the /i modifier will work on non-English characters--as will other features like regex matching of word boundaries and the uc and lc functions.

Note that if you need to handle multiple foreign character sets, the linked documentation shows you how to switch locales within your script as needed, using setlocale().

Edit: I should have mentioned that this method is deprecated in most cases. Things should just work with UTF-8. But it can still be useful sometimes.

0
votes
use locale;
use POSIX qw(locale_h);
setlocale (LC_ALL, $locale{German}) or die "failed to load locale!";