4
votes

I've run into a problem when trying to parse a JSON string that I grab from a file. My problem is that the Zero width no-break space character (unicode 0xfeff) is at the beginning of my string when I read it in, and I cannot get rid of it. I don't want to use regex because of the chance there may be other hidden characters with different unicodes.

Here's what I have:

StringBuilder content = new StringBuilder();
    try {
        BufferedReader br = new BufferedReader(new FileReader("src/test/resources/getStuff.json"));
        String currentLine;
        while((currentLine = br.readLine()) != null) {
            content.append(currentLine);
        }
        br.close();
    } catch(Exception e) {
        Assert.fail();
    }

And this is the the start of the JSON file (it's too long to copy paste the whole thing, but I have confirmed it is valid):

{"result":{"data":{"request":{"year":null,"timestamp":1413398641246,...

Here's what I've tried so far:

  • Copying the JSON file to notepad++ and showing all characters
  • Copying file to notepad++ and converting to UFT-8 without BOM, and ISO 8859-1
  • Opened JSON file in other text editors such as sublime and saved as UFT-8
  • Copied the JSON file to a txt file and read that in
  • Tried using Scanner instead of BufferedReader
  • In intellij I tried view -> active editor -> show whitespaces

How can I read this file in without having the Zero width no-break space character at the beginning of the string?

1

1 Answers

4
votes

0xEF 0xBB 0xBF is the UTF-8 BOM, 0xFE 0xFF is the UTF-16BE BOM, and 0xFF 0xFE is the UTF-16LE BOM. If 0xFEFF exists at the front of your String, it means you created a UTF encoded text file with a BOM. A UTF-16 BOM could appear as-is as 0xFEFF, whereas a UTF-8 BOM would only appear as 0xFEFF if the BOM itself were being decoded from UTF-8 to UTF-16 (meaning the reader detected the BOM but did not skip it). In fact, it is known that Java does not handle UTF-8 BOMs (see bugs JDK-4508058 and JDK-6378911).

If you read the FileReader documentation, it says:

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

You need to read the file content using a reader that recognizes charsets, preferably one that will read the BOM for you and adjust itself internally as needed. But worse case, you could just open the file yourself, read the first few bytes to detect if a BOM is present, and then construct a reader using an appropriate charset to read the rest of the file. Here is an example using org.apache.commons.io.input.BOMInputStream that does exactly that:

(from https://stackoverflow.com/a/13988345/65863)

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}