How can I write a regex that matches only letters?
20 Answers
Use a character set: [a-zA-Z]
matches one letter from A–Z in lowercase and uppercase. [a-zA-Z]+
matches one or more letters and ^[a-zA-Z]+$
matches only strings that consist of one or more letters only (^
and $
mark the begin and end of a string respectively).
If you want to match other letters than A–Z, you can either add them to the character set: [a-zA-ZäöüßÄÖÜ]
. Or you use predefined character classes like the Unicode character property class \p{L}
that describes the Unicode characters that are letters.
Regular expression which few people has written as "/^[a-zA-Z]$/i" is not correct because at the last they have mentioned /i which is for case insensitive and after matching for first time it will return back. Instead of /i just use /g which is for global and you also do not have any need to put ^ $ for starting and ending.
/[a-zA-Z]+/g
- [a-z_]+ match a single character present in the list below
- Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed
- a-z a single character in the range between a and z (case sensitive)
- A-Z a single character in the range between A and Z (case sensitive)
- g modifier: global. All matches (don't return on first match)
In python, I have found the following to work:
[^\W\d_]
This works because we are creating a new character class (the []
) which excludes (^
) any character from the class \W
(everything NOT in [a-zA-Z0-9_]
), also excludes any digit (\d
) and also excludes the underscore (_
).
That is, we have taken the character class [a-zA-Z0-9_]
and removed the 0-9
and _
bits. You might ask, wouldn't it just be easier to write [a-zA-Z]
then, instead of [^\W\d_]
? You would be correct if dealing only with ASCII text, but when dealing with unicode text:
\W
Matches any character which is not a word character. This is the opposite of \w. > If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_].
^ from the python re module documentation
That is, we are taking everything considered to be a word character in unicode, removing everything considered to be a digit character in unicode, and also removing the underscore.
For example, the following code snippet
import re
regex = "[^\W\d_]"
test_string = "A;,./>>?()*)&^*&^%&^#Bsfa1 203974"
re.findall(regex, test_string)
Returns
['A', 'B', 's', 'f', 'a']
If you mean any letters in any character encoding, then a good approach might be to delete non-letters like spaces \s
, digits \d
, and other special characters like:
[!@#\$%\^&\*\(\)\[\]:;'",\. ...more special chars... ]
Or use negation of above negation to directly describe any letters:
\S \D and [^ ..special chars..]
Pros:
- Works with all regex flavors.
- Easy to write, sometimes save lots of time.
Cons:
- Long, sometimes not perfect, but character encoding can be broken as well.
So, I've been reading a lot of the answers, and most of them don't take exceptions into account, like letters with accents or diaeresis (á, à, ä, etc.).
I made a function in typescript that should be pretty much extrapolable to any language that can use RegExp. This is my personal implementation for my use case in TypeScript. What I basically did is add ranges of letters with each kind of symbol that I wanted to add. I also converted the char to upper case before applying the RegExp, which saves me some work.
function isLetter(char: string): boolean {
return char.toUpperCase().match('[A-ZÀ-ÚÄ-Ü]+') !== null;
}
If you want to add another range of letters with another kind of accent, just add it to the regex. Same goes for special symbols.
I implemented this function with TDD and I can confirm this works with, at least, the following cases:
character | isLetter
${'A'} | ${true}
${'e'} | ${true}
${'Á'} | ${true}
${'ü'} | ${true}
${'ù'} | ${true}
${'û'} | ${true}
${'('} | ${false}
${'^'} | ${false}
${"'"} | ${false}
${'`'} | ${false}
${' '} | ${false}
characters
? ASCII? Kanji? Iso-XXXX-X? UTF8? – Ivo Wetzelregex
? Perl? Emacs? Grep? – Pascal Cuoq/\p{L}+/u
– MaxZoom