1
votes

I have an Android Aplication that reads a file with SQL script to insert data into a SQLite DB. However I need to know the exatly encoding of this file, I have an EditText that reads information from SQLite, and if the encoding is not right, it'll be shown as invalid characters like "?" instead of characters like "ç, í, ã".

I have the following code:

FileInputStream fIn  = new FileInputStream(myFile);
BufferedReader myReader = new BufferedReader(new InputStreamReader(fIn, "ISO-8859-1"));
String aDataRow;
while ((aDataRow = myReader.readLine()) != null) {
    if(!aDataRow.isEmpty()){
        String[] querys = aDataRow.split(";");
        Collections.addAll(querysParaExecutar, querys);
    }
}
myReader.close();

this works for "ISO-8859-1" encoding, and works for UTF-8 if I set to "UTF-8" as a charset. I need to programatically detect the charset encoding (UTF-8 or ISO-8859-1) and apply the correct one to my code. Is there a simple way to do that?

1
There is no foolproof way to determine character encoding from the encoded character data alone. There are heuristic approaches that even a cursory web search should have turned up, but the usual mechanism is to rely on the encoding to be specified separately from the content.John Bollinger
Usually the encoding is specified by whomever inserted the data or separately from the contentsdfbhg
that's true, however the problem is: my users are going to edit the file, if they edit with windows notepad and save it, it'll always get the encoding as "ISO-8859-1". The original file encoding is "UTF-8".Kevin Giediel
I suggest you look for an encoding guessing library. Validating UTF-8 is pretty straight-forward, but when it comes Notepad and friends, you will get whatever 8-bit encoding is suitable for the user's localisation – eg. Polish ł can't be represented in ISO-8859-1. And for distinguishing different ISO-8859-X encodings, you need statistics of character distributions in different languages.lenz
@hooknc: US-ASCII is a subset of UTF-8. ISO-8859-1 is a subset in terms of character repertoire, but not in terms of encoded bytes.dan04

1 Answers

0
votes

I resolved the problem with the lib universal chardet. It's working fine as expected.

FileInputStream fIn  = new FileInputStream(myFile);
            byte[] buf = new byte[4096];
            UniversalDetector detector = new UniversalDetector(null);
            int nread;
            while ((nread = fIn.read(buf)) > 0 && !detector.isDone()) {
                detector.handleData(buf, 0, nread);
            }
            detector.dataEnd();
            String encoding = detector.getDetectedCharset();
            String chartsetName = null;
            if (encoding.equalsIgnoreCase("WINDOWS-1252")){
                chartsetName = "ISO-8859-1";
            }
            if (encoding.equalsIgnoreCase("UTF-8")){
                chartsetName = "UTF-8";
            }

            BufferedReader myReader = new BufferedReader(new InputStreamReader(fIn, chartsetName));