0
votes

I went through a few posts, like FileReader reads the file as a character stream and can be treated as whitespace if the document is handed as a stream of characters where the answers say the input source is actually a char stream, not a byte stream.

However, the suggested solution from 1 does not seem to apply to UTF-16LE. Although I use this code:

    try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
      DOMParser parser = new org.apache.xerces.parsers.DOMParser();
      parser.parse(new InputSource(is));
      return parser.getDocument();
    } catch (final SAXParseException saxEx) {
      LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
    }

I still get org.xml.sax.SAXParseException: Content is not allowed in prolog..

I looked at Files.newInputStream, and it indeed uses a ChannelInputStream which will hand over bytes, not chars. I also tried to set the Encoding of the InputSource object, but with no luck. I also checked that there are not extra chars (except the BOM) before the <?xml part.

I also want to mention that this code works just fine with UTF-8.

// Edit: I also tried DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() and XmlInputStreamReader.next(), same results.

// Edit 2: Tried using a buffered reader. Same results: Unexpected character '뿯' (code 49135 / 0xbfef) in prolog; expected '<'

Thanks in advance.

1
What if you remove the BOM at the beginning (skipping the first two bytes)? ... { is.read(): is.read();Joop Eggen
Then I wouldn't be able to read UTF-8 without bom or ISO-8859-1. :(Ben
The encoding is given or defaulted UTF-8 in <?xml encoding=...?>. I have heard that in rare cases a BOM gave such a problem. But I do not remember specifics.Joop Eggen
I cannot get even this far. I want to read that tag and attribute you refer to. But see my 2nd edit, it stops before that.Ben
I double checked. The file is starting with the BOM 0xFF 0xFE. Maybe I need to wrap it into a BOMRemovingInputStream…Ben

1 Answers

1
votes

To get a bit farther some info gathering:

byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
    LOG.info("Has BOM and is evidently UTF_16LE");
    xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
    LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
    declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);

try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
  DOMParser parser = new org.apache.xerces.parsers.DOMParser();
  parser.parse(new InputSource(is));
  return parser.getDocument();
} catch (final SAXParseException saxEx) {
  LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}