190
votes

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.

However, is there a regular expression for 'any character that's not an ASCII character'?

9
Paul, yes I can use perlRory
/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]Tinmarino

9 Answers

336
votes

This will match a single non-ASCII character:

[^\x00-\x7F]

This is a valid PCRE (Perl-Compatible Regular Expression).

You can also use the POSIX shorthands:

  • [[:ascii:]] - matches a single ASCII char
  • [^[:ascii:]] - matches a single non-ASCII char

[^[:print:]] will probably suffice for you.**

40
votes

No, [^\x20-\x7E] is not ASCII.

This is real ASCII:

 [^\x00-\x7F]

Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!

6
votes

You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:

\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
3
votes

[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.

3
votes

To Validate Text Box Accept Ascii Only use this Pattern

[\x00-\x7F]+

3
votes

I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.

2
votes

You can use this regex:

[^\w \xC0-\xFF]

Case ask, the options is Multiline.

2
votes

You don't really need a regex.

printf "%s\n" *[!\ -~]*

This will show file names with control characters in their names, too, but I consider that a feature.

If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)

1
votes

This turned out to be very flexible and extensible. $field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.