10
votes

i need to convert a ISO-8859-1 file to utf-8 encoding, without loosing content intormations...

i have a file which looks like this:

<?xml version="1.0" encoding="ISO-8859-1" ?> 
<HelloEncodingWorld>Üöäüßßß Test!!!</HelloEncodingWorld>

Not i want to encode it into UTF-8. I tried following:

f=new File('c:/temp/myiso88591.xml').getText('ISO-8859-1')
ts=new String(f.getBytes("UTF-8"), "UTF-8")
g=new File('c:/temp/myutf8.xml').write(ts)

didnt work due to String incompatibilities. Then i read something about bytestreamreaders/writers/streamingmarkupbuilder and other...

then i tried

f=new File('c:/temp/myiso88591.xml').getText('ISO-8859-1')
mb = new groovy.xml.StreamingMarkupBuilder()
mb.encoding = "UTF-8"

new OutputStreamWriter(new FileOutputStream('c:/temp/myutf8.xml'),'utf-8') << mb.bind {
    mkp.xmlDeclaration()
    out << f
}

this was totally not that what i wanted..

I just want to get the content of an xml read with an ISO-8859-1 reader and then put it into a new (old) file... why this is so complicated :-/

The result should just be, and the file should be really encoded in utf-8:

<?xml version="1.0" encoding="UTF-8" ?> 
<HelloEncodingWorld>Üöäüßßß Test!!!</HelloEncodingWorld>

Thanks for any answers Cheers

2
I haven't got the first idea about Groovy, but I assume that if you specify the encoding of the file for File.getText, it will be converted from that encoding to your internal encoding automatically. I.e. you probably don't need to do anything else as long as your internal encoding is set to use UTF-8. Somebody correct me if I'm off-track here. Alternatively, what are the exact errors you get?deceze♦

2 Answers

14
votes
def f=new File('c:/data/myiso88591.xml').getText('ISO-8859-1')
new File('c:/data/myutf8.xml').write(f,'utf-8')

(I just gave it a try, it works :-)

same as in java: the libraries do the conversion for you... as deceze said: when you specify an encoding, it will be converted to an internal format (utf-16 afaik). When you specify another encoding when you write the string, it will be converted to this encoding.

But if you work with XML, you shouldn't have to worry about the encoding anyway because the XML parser will take care of it. It will read the first characters <?xml and determines the basic encoding from those characters. After that, it is able to read the encoding information from your xml header and use this.

11
votes

Making it a little more Groovy, and not requiring the whole file to fit in memory, you can use the readers and writers to stream the file. This was my solution when I had files too big for plain old Unix iconv(1).

new FileOutputStream('out.txt').withWriter('UTF-8') { writer ->
    new FileInputStream('in.txt').withReader('ISO-8859-1') { reader ->
        writer << reader
    }
}