How can I find encoding of a file via a script on Linux?

352

votes

I need to find the encoding of all files that are placed in a directory. Is there a way to find the encoding used?

The file command is not able to do this.

The encoding that is of interest to me is ISO 8859-1. If the encoding is anything else, I want to move the file to another directory.

fileshellunixencoding

If you have an idea of what kind of scripting language you might want to use, tag your question with the name of that language. That might help... - MatrixFrog

Or maybe he's just trying to build a shell script? - Shalom Craimer

Which would be an answer to “which scripting language”. - bignose

Maybe not related to this answer, but a tip in general: When you can describe your entire doubt in one word ("encoding", here), just do apropos encoding. It searches the titles and descriptions of all the manpages. When I do this on my machine, I see 3 tools that might help me, judging by their descriptions: chardet, chardet3, chardetect3. Then, by doing man chardet and reading the manpage tells me that chardet is just the utility I need. - John Red

The encoding might change when you change content of a file. e.g In vi, when write a simple c program, it's probably us-ascii, but after add a line of Chinese comment, it becomes utf-8. file can tell the encoding by reading the file content & guess. - user218867

489

votes

It sounds like you're looking for enca. It can guess and even convert between encodings. Just look at the man page.

Or, failing that, use file -i (Linux) or file -I (OS X). That will output MIME-type information for the file, which will also include the character-set encoding. I found a man-page for it, too :)

96

votes

file -bi <file name>

If you like to do this for a bunch of files

for f in `find | egrep -v Eliminate`; do echo "$f" ' -- ' `file -bi "$f"` ; done

44

votes

uchardet - An encoding detector library ported from Mozilla.

Usage:

~> uchardet file.java
UTF-8

Various Linux distributions (Debian, Ubuntu, openSUSE, Pacman, etc.) provide binaries.

11

votes

Here is an example script using file -I and iconv which works on Mac OS X.

For your question, you need to use mv instead of iconv:

#!/bin/bash
# 2016-02-08
# check encoding and convert files
for f in *.java
do
  encoding=`file -I $f | cut -f 2 -d";" | cut -f 2 -d=`
  case $encoding in
    iso-8859-1)
    iconv -f iso8859-1 -t utf-8 $f > $f.utf8
    mv $f.utf8 $f
    ;;
  esac
done

9

votes

In Debian you can also use: encguess:

$ encguess test.txt
test.txt  US-ASCII

9

votes

To convert encoding from ISO 8859-1 to ASCII:

iconv -f ISO_8859-1 -t ASCII filename.txt

6

votes

It is really hard to determine if it is ISO 8859-1. If you have a text with only 7-bit characters that could also be ISO 8859-1, but you don't know. If you have 8-bit characters then the upper region characters exist in order encodings as well. Therefore you would have to use a dictionary to get a better guess which word it is and determine from there which letter it must be. Finally, if you detect that it might be UTF-8 then you are sure it is not ISO 8859-1.

Encoding is one of the hardest things to do, because you never know if nothing is telling you.

5

votes

With Python, you can use the chardet module.

3

votes

This is not something you can do in a foolproof way. One possibility would be to examine every character in the file to ensure that it doesn't contain any characters in the ranges 0x00 - 0x1f or 0x7f -0x9f but, as I said, this may be true for any number of files, including at least one other variant of ISO 8859.

Another possibility is to look for specific words in the file in all of the languages supported and see if you can find them.

So, for example, find the equivalent of the English "and", "but", "to", "of" and so on in all the supported languages of ISO 8859-1 and see if they have a large number of occurrences within the file.

I'm not talking about literal translation such as:

English   French
-------   ------
of        de, du
and       et
the       le, la, les

although that's possible. I'm talking about common words in the target language (for all I know, Icelandic has no word for "and" - you'd probably have to use their word for "fish" [sorry that's a little stereotypical. I didn't mean any offense, just illustrating a point]).

2

votes

I know you're interested in a more general answer, but what's good in ASCII is usually good in other encodings. Here is a Python one-liner to determine if standard input is ASCII. (I'm pretty sure this works in Python 2, but I've only tested it on Python 3.)

python -c 'from sys import exit,stdin;exit()if 128>max(c for l in open(stdin.fileno(),"b") for c in l) else exit("Not ASCII")' < myfile.txt

2

votes

If you're talking about XML files (ISO-8859-1), the XML declaration inside them specifies the encoding: <?xml version="1.0" encoding="ISO-8859-1" ?> So, you can use regular expressions (e.g., with Perl) to check every file for such specification.

More information can be found here: How to Determine Text File Encoding.

2

votes

In PHP you can check it like below:

Specifying the encoding list explicitly:

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"

More accurate "mb_list_encodings":

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"

Here in the first example, you can see that I used a list of encodings (detect list order) that might be matching. To have a more accurate result, you can use all possible encodings via: mb_list_encodings()

Note the mb_* functions require php-mbstring:

apt-get install php-mbstring

1

votes

I am using the following script to

Find all files that match FILTER with SRC_ENCODING
Create a backup of them
Convert them to DST_ENCODING
(optional) Remove the backups

#!/bin/bash -xe

SRC_ENCODING="iso-8859-1"
DST_ENCODING="utf-8"
FILTER="*.java"

echo "Find all files that match the encoding $SRC_ENCODING and filter $FILTER"
FOUND_FILES=$(find . -iname "$FILTER" -exec file -i {} \; | grep "$SRC_ENCODING" | grep -Eo '^.*\.java')

for FILE in $FOUND_FILES ; do
    ORIGINAL_FILE="$FILE.$SRC_ENCODING.bkp"
    echo "Backup original file to $ORIGINAL_FILE"
    mv "$FILE" "$ORIGINAL_FILE"

    echo "converting $FILE from $SRC_ENCODING to $DST_ENCODING"
    iconv -f "$SRC_ENCODING" -t "$DST_ENCODING" "$ORIGINAL_FILE" -o "$FILE"
done

echo "Deleting backups"
find . -iname "*.$SRC_ENCODING.bkp" -exec rm {} \;

1

votes

With this command:

for f in `find .`; do echo `file -i "$f"`; done

you can list all files in a directory and subdirectories and the corresponding encoding.

If files have a space in the name, use:

IFS=$'\n'
for f in `find .`; do echo `file -i "$f"`; done

Remember it'll change your current Bash session interpreter for "spaces".

0

votes

You can extract encoding of a single file with the file command. I have a sample.html file with:

$ file sample.html

sample.html: HTML document, UTF-8 Unicode text, with very long lines

$ file -b sample.html

HTML document, UTF-8 Unicode text, with very long lines

$ file -bi sample.html

text/html; charset=utf-8

$ file -bi sample.html  | awk -F'=' '{print $2 }'

utf-8

0

votes

In Cygwin, this looks like it works for me:

find -type f -name "<FILENAME_GLOB>" | while read <VAR>; do (file -i "$<VAR>"); done

Example:

find -type f -name "*.txt" | while read file; do (file -i "$file"); done

You could pipe that to AWK and create an iconv command to convert everything to UTF-8, from any source encoding supported by iconv.

Example:

find -type f -name "*.txt" | while read file; do (file -i "$file"); done | awk -F[:=] '{print "iconv -f "$3" -t utf8 \""$1"\" > \""$1"_utf8\""}' | bash

-2

votes

With Perl, use Encode::Detect.

How can I find encoding of a file via a script on Linux?

17 Answers