5
votes

I'd like to write a clojure function that takes a string in one encoding and converts it to another. The iconv library does this.

For example, let's look at the character "è". In ISO-8859-1 (http://www.ascii-code.com/), that's e8 as hex. In UTF-8 (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A8&mode=char), it's c3 a8.

So let's say we have iso.txt, which contains our letter and EOL:

$ hexdump iso.txt                               
0000000 e8 0a                  
0000002

Now we can convert it to UTF-8 like this:

$ iconv -f ISO-8859-1 -t UTF-8 iso.txt | hexdump
0000000 c3 a8 0a                                       
0000003

How should I write something equivalent in clojure? I'm happy to use any external libraries, but I don't know where I'd go to find them. Looking around I couldn't figure out how to use libiconv itself on the JVM, but there's probably an alternative?

Edit

After reading Alex's link in the comment, this is so simple and so cool:

user> (new String (byte-array 2 (map unchecked-byte [0xc3 0xa8])) "UTF-8")
"è"

user> (new String (byte-array 1 [(unchecked-byte 0xe8)]) "ISO-8859-1")
"è"
1
A point of clarification: strings in Java (and therefore in Clojure) are defined as a sequence of Unicode characters, and therefore always have the same representation. It's only when translating between strings/characters and underlying bytes that encoding comes into play.Alex
@alex I see, but how would I do it operating on the byte level then? Is there a way to convert the hex value e8 to the string which is the unicode character 'è'?spike

1 Answers

8
votes

If you want a simple whole-file conversion to UTF-8, slurp allows for specifying the file encoding with the :encoding option and spit will output UTF-8 by default. This method will read the entire file into memory, so large files might require a different approach.

$ printf "\xe8\n" > iso.txt
$ hexdump iso.txt
0000000 e8 0a                                          
0000002

(spit "/Users/path/iso2.txt"
      (slurp "/Users/path/iso.txt" :encoding "ISO-8859-1"))

$ hexdump iso2.txt
0000000 c3 a8 0a                                       
0000003

Note: slurp will assume UTF-8 if you do not specify an encoding.