How can I identify different encodings against files without the use of a BOM and beginning with non-ASCII character?

Question

I got a problem when trying to identify the encoding of a file without BOM, particularly when the file is beginning with non-ascii characters.

I found following two topics about how to identify encodings for files,

Currently, I created a class to identify different encodings for files (e.g. UTF-8, UTF-16, UTF-32, UTF-16 no BOM, etc) like following,

public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;

/**
 * Construct UnicodeReader
 * @param in Input stream.
 * @param defaultEncoding Default encoding to be used if BOM is not found,
 * or <code>null</code> to use system default encoding.
 * @throws IOException If an I/O error occurs.
 */
public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
    byte bom[] = new byte[BOM_SIZE];
    String encoding;
    int unread;
    PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
    int n = pushbackStream.read(bom, 0, bom.length);

    // Read ahead four bytes and check for BOM marks.
    if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
        encoding = "UTF-8";
        unread = n - 3;
    } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
        encoding = "UTF-16BE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
        encoding = "UTF-16LE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
        encoding = "UTF-32BE";
        unread = n - 4;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
        encoding = "UTF-32LE";
        unread = n - 4;
    } else {
        // No BOM detected but still could be UTF-16
        int found = 0;
        for (int i = 0; i < 4; i++) {
            if (bom[i] == (byte) 0x00)
                found++;
        }

        if(found >= 2) {
            if(bom[0] == (byte) 0x00){
                encoding = "UTF-16BE";
            }
            else {
                encoding = "UTF-16LE";
            }
            unread = n;
        }
        else {
            encoding = defaultEncoding;
            unread = n;
        }
    }

    // Unread bytes if necessary and skip BOM marks.
    if (unread > 0) {
        pushbackStream.unread(bom, (n - unread), unread);
    } else if (unread < -1) {
        pushbackStream.unread(bom, 0, 0);
    }

    // Use given encoding.
    if (encoding == null) {
        reader = new InputStreamReader(pushbackStream);
    } else {
        reader = new InputStreamReader(pushbackStream, encoding);
    }
}

public String getEncoding() {
    return reader.getEncoding();
}

public int read(char[] cbuf, int off, int len) throws IOException {
    return reader.read(cbuf, off, len);
}

public void close() throws IOException {
    reader.close();
}

}

The above code could work properly all the cases except when file without BOM and beginning with non-ascii characters. Since under this circumstance, the logic for checking if file still be UTF-16 without BOM will not work correctly, and the encoding will be set as UTF-8 as default.

If there is a way to check encodings of file without BOM and beggining with non-ascii characters, especially for UTF-16 NO BOM file ?

Thanks, any idea would be appreciated.

heuristics... It's done by a lot of programs in a lot of cases (the Unx *file command being an amazing example). I've done it "manually" (re-inventing my own wheel which works fine) but nowadays I'd simply take "Stephen C"'s answer: re-use existing code already doing it. — SyntaxT3rr0r
@SyntaxT3rr0r: Yeah, it's a good way to resolve this problem. Cause of our limitation for introducing third-party library into products, I would prefer to use my own wheel to deal with it by improving the code I provided. — Eason

Vladimir Dyuzhev Vladimir Dyuzhev · Accepted Answer · 2011-04-14T02:01:30

Generally speaking, there is no way to know encoding for sure if it is not provided.

You may guess UTF-8 by specific pattern in the texts (high bit set, set, set, not set, set, set, set, not set), but it is still a guess.

UTF-16 is a hard one; you can successfully parse BE and LE on the same stream; both ways it will produce some characters (potentially meaningless text though).

Some code out there uses statistical analysis to guess the encoding by the frequency of the symbols, but that requires some assumptions about the text (i.e. "this is a Mongolian text") and frequencies tables (which may not match the text). At the end of the day this remains just a guess, and cannot help in 100% of cases.

How can I identify different encodings against files without the use of a BOM and beginning with non-ASCII character?

3 Answers