5
votes

Why is an iso-8859-1 character not converted to utf-8 in the output file when setting output encoding to utf-8?

I have an xml input file in iso-8859-1 encoding, and the encoding is declared. I want to output it in utf-8. My understanding is that setting the output encoding in the xslt file should manage the character conversion.

Is my understanding wrong? If not, why does the following simple test case output an iso-8859-1 character in a utf-8 declared output file?

My input file looks like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<data>ö</data>

My transform looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
    <xsl:output encoding="UTF-8" />
    <xsl:template match="/">
        <result>
            <xsl:value-of select="." />
        </result>
    </xsl:template>
</xsl:stylesheet>

Using saxon9he from the command line my result looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<result>ö</result>

The ö in my result file is 0xF6 according to BabelPad, which is an invalid utf-8 character. The ö seems to be untouched by the transformation.

Thanks for any help!

1
I assume you're using a library to process the XLST transformations. Providing that library, and the code that interfaces with it, might be useful. Perhaps it's a setting on the library. - Jazzepi
What makes you think that ö isn't a valid UTF-8 character? - Mads Hansen
@MadsHansen: you're falling victim to a level error. UTF-8 isn't a set of characters but an encoding for the characters in the Universal Character Set (UCS) defined by Unicode and ISO 10646. The OP isn't saying ö is not a legal character but that xF6 is not a legal UTF-8 encoding of that or any character. In this, the OP is entirely correct. - C. M. Sperberg-McQueen
Fwiw, I would expect any XSLT processor to behave as you say you expect. And given that you're using Saxon, my first instinct is to ask: are you sure that neither BabelPad or anything else is messing with the character encoding after Saxon emits it? I'm not familiar with BabelPad - are you sure you're interpreting what it tells you correctly? What does hexdump say? (When I run Saxon HE on your input, hexdump tells me the ö is F6 in the input and C3 B6 in the output.) - C. M. Sperberg-McQueen
Please just use a browser (like Google Chrome) to open the fresh xml file... do you get an xml error? If BabelPad is right, you should see error on line 2 at column 9: Encoding error - Esailija

1 Answers

5
votes

I can see two possible explanations (thought there are probably others).

(a) the final stage of serialization, that is, converting characters to bytes, is not being done by the XSLT processor but by some other piece of software that does not have access to the stylesheet. This would happen, for example, if you run the transformation in a Java application that sends the output to a Writer rather than an OutputStream - the Writer would convert characters to bytes using the platform default encoding, which is probably iso-8859-1.

(b) the octets you are seeing in your display are not the octets stored on disk, but some transformation of them. This can happen when you load a file into an editor and then ask for a hex display; in some cases you will get a hex display of the editor's in-memory representation of the document, not of what is stored on disk.