java how to distinguish a file encoding ISO-8859-1 and UTF-8?

Question

I have an Android Aplication that reads a file with SQL script to insert data into a SQLite DB. However I need to know the exatly encoding of this file, I have an EditText that reads information from SQLite, and if the encoding is not right, it'll be shown as invalid characters like "?" instead of characters like "ç, í, ã".

I have the following code:

FileInputStream fIn  = new FileInputStream(myFile);
BufferedReader myReader = new BufferedReader(new InputStreamReader(fIn, "ISO-8859-1"));
String aDataRow;
while ((aDataRow = myReader.readLine()) != null) {
    if(!aDataRow.isEmpty()){
        String[] querys = aDataRow.split(";");
        Collections.addAll(querysParaExecutar, querys);
    }
}
myReader.close();

this works for "ISO-8859-1" encoding, and works for UTF-8 if I set to "UTF-8" as a charset. I need to programatically detect the charset encoding (UTF-8 or ISO-8859-1) and apply the correct one to my code. Is there a simple way to do that?

There is no foolproof way to determine character encoding from the encoded character data alone. There are heuristic approaches that even a cursory web search should have turned up, but the usual mechanism is to rely on the encoding to be specified separately from the content. — John Bollinger
Usually the encoding is specified by whomever inserted the data or separately from the content — sdfbhg
that's true, however the problem is: my users are going to edit the file, if they edit with windows notepad and save it, it'll always get the encoding as "ISO-8859-1". The original file encoding is "UTF-8". — Kevin Giediel
I suggest you look for an encoding guessing library. Validating UTF-8 is pretty straight-forward, but when it comes Notepad and friends, you will get whatever 8-bit encoding is suitable for the user's localisation – eg. Polish ł can't be represented in ISO-8859-1. And for distinguishing different ISO-8859-X encodings, you need statistics of character distributions in different languages. — lenz
@hooknc: US-ASCII is a subset of UTF-8. ISO-8859-1 is a subset in terms of character repertoire, but not in terms of encoded bytes. — dan04

Kevin Giediel Kevin Giediel · Accepted Answer · 2017-10-30T19:17:06

I resolved the problem with the lib universal chardet. It's working fine as expected.

FileInputStream fIn  = new FileInputStream(myFile);
            byte[] buf = new byte[4096];
            UniversalDetector detector = new UniversalDetector(null);
            int nread;
            while ((nread = fIn.read(buf)) > 0 && !detector.isDone()) {
                detector.handleData(buf, 0, nread);
            }
            detector.dataEnd();
            String encoding = detector.getDetectedCharset();
            String chartsetName = null;
            if (encoding.equalsIgnoreCase("WINDOWS-1252")){
                chartsetName = "ISO-8859-1";
            }
            if (encoding.equalsIgnoreCase("UTF-8")){
                chartsetName = "UTF-8";
            }

            BufferedReader myReader = new BufferedReader(new InputStreamReader(fIn, chartsetName));

java how to distinguish a file encoding ISO-8859-1 and UTF-8?

1 Answers