Matching accented characters with Javascript regexes

55

votes

Here's a fun snippet I ran into today:

/\ba/.test("a") --> true
/\bà/.test("à") --> false

However,

/à/.test("à") --> true

Firstly, wtf?

Secondly, if I want to match an accented character at the start of a word, how can I do that? (I'd really like to avoid using over-the-top selectors like /(?:^|\s|'| ....)

javascriptregexunicodeinternationalization

The answer to your WTF is that Javascript doesn’t handle Unicode correctly in regular expressions. See the standard to see how it is supposed to work. Or use a language that’s standards-compliant in this regard. Just to name a few... in Perl, PHP, PCRE, and ICU regexes, "à" certainly matches the pattern /\bà/. They’re much better for Unicode work. – tchrist

you may want to remove accents & then do a simple [a-z] check. see stackoverflow.com/questions/990904/… – Adrien Be

66

votes

This worked for me:

/^[a-z\u00E0-\u00FC]+$/i

With help from here

40

votes

The reason why /\bà/.test("à") doesn't match is because "à" is not a word character. The escape sequence \b matches only between a boundary of word character and a non word character. /\ba/.test("a") matches because "a" is a word character. Because of that, there is a boundary between the beginning of the string (which is not a word character) and the letter "a" which is a word character.

Word characters in JavaScript's regex is defined as [a-zA-Z0-9_].

To match an accented character at the start of a string, just use the ^ character at the beginning of the regex (e.g. /^à/). That character means the beginning of the string (unlike \b which matches at any word boundary within the string). It's most basic and standard regular expression, so it's definitely not over the top.

2

votes

Stack Overflow had also an issue with non ASCII characters in regex, you can find it here. They are not coping with word boundaries, but maybe gives you anyway useful hints.

There is another page, but he wants to match strings and not words.

I don't know, and did not find now, an anchor for your problem, but when I see what monster regexes in my first link are used, your group, that you want to avoid, is not over the top and to my opinion your solution.

2

votes

const regex = /^[\-/A-Za-z\u00C0-\u017F ]+$/;
const test1 = regex.test("à");
const test2 = regex.test("Martinez-Cortez");
const test3 = regex.test("Leonardo da vinci");
const test4 = regex.test("ï");

console.log('test1', test1);
console.log('test2', test2);
console.log('test3', test3);
console.log('test4', test4);

Building off of Wak's and Cœur's answer:

/^[\-/A-Za-z\u00C0-\u017F ]+$/

Works for spaces and dashes too.

Example: Leonardo da vinci, Martinez-Cortez

1

votes

If you want to match letters, whether or not they're accented, unicode property escapes can be helpful.

/\p{Letter}*/u.test("à"); // true
/\p{Letter}/u.test('œ'); // true
/\p{Letter}/u.test('a'); // true
/\p{Letter}/u.test('3'); // false
/\p{Letter}/u.test('a'); // true

Matching to the start of a word is tricky, but (?<=(?:^|\s)) seems to do the trick. The (?<= ) is a positive lookbehind, ensuring that something exists before the main expression. The (?: ) is a non-capture group, so you don't end up with a reference to this part in whatever match you use later. Then the ^ will match the start of the string if the multiline flag isn't set or the start of the line if the multiline flag is set and the \s will match a whitespace character (space/tab/linebreak).

So using them together, it would look something like:

/(?<=(?:^|\s))\p{Letter}*/u

If you want to only match accented characters to the start of the string, you'd want a negated character set for a-zA-Z.

/(?<=(?:^|\s))[^a-zA-Z]\p{Letter}*/u.match("bœ") // false
/(?<=(?:^|\s))[^a-zA-Z]\p{Letter}*/u.match("œb") // true

// Match characters, accented or not
let regex = /\p{Letter}+$/u;

console.log(regex.test("œb")); // true
console.log(regex.test("bœb")); // true
console.log(regex.test("àbby")); // true
console.log(regex.test("à3")); // false
console.log(regex.test("16 tons")); // true
console.log(regex.test("3 œ")); // true

console.log('-----');

// Match characters to start of line, only match characters

regex = /(?<=(?:^|\s))\p{Letter}+$/u;

console.log(regex.test("œb")); // true
console.log(regex.test("bœb")); // true
console.log(regex.test("àbby")); // true
console.log(regex.test("à3")); // false

console.log('----');

// Match accented character to start of word, only match characters

regex = /(?<=(?:^|\s))[^a-zA-Z]\p{Letter}+$/u;

console.log(regex.test("œb")); // true
console.log(regex.test("bœb")); // false
console.log(regex.test("àbby")); // true
console.log(regex.test("à3")); // false

0

votes

Unicode allows for two alternative but equivalent representations of some accented characters. For example, é has two Unicode representations: '\u0039' and '\u0065\u0301'. The former is called composed form and the latter is called decomposed form. JavaScript allows for conversion between the two:

'é'.normalize('NFD') // decompose: '\u0039' -> '\u0065\u0301'
'é'.normalize('NFC') // compose: '\u0065\u0301' -> '\u0039'
'é'.length // composed form: -> 1
'é'.length // decomposed form: -> 2 (looks identical but has different representation)
'é' == 'é' // -> false (composed and decomposed strings are not equal)

The code point '\u0301' belongs to the Unicode Combining Diacritical Marks code block 0300-036F. So one way to match these accented characters is to compare them in decomposed form:

// matching accented characters
/[a-zA-Z][\u0300-\u036f]+/.test('é'.normalize('NFD')) // -> true
/\bé/.test('é') // -> false
/\bé/.test('é'.normalize('NFD')) // -> true (NOTE: /\bé/ uses the decomposed form)

// matching accented words
/^\w+$/.test('résumé') // -> false
/^(?:[a-zA-Z][\u0300-\u036f]*)+$/.test('résumé'.normalize('NFD')) // -> true

Matching accented characters with Javascript regexes

6 Answers