I have a string that is in Windows-1252 encoding, but needs to be converted to UTF-8.
This is for a program that fixes a UTF-8 file that has fields containing Russian text encoded in quoted-printable Windows-1252. Here's the code that decodes the quoted-printable:
(defn reencode
[line]
(str/replace line #"=([0-9A-Fa-f]{2})=([0-9A-Fa-f]{2})"
(fn [match] (apply str
(map #(char (Integer/parseInt % 16)) (drop 1 match))))))
Here's the final code:
(defn reencode
[line]
(str/replace line #"(=([0-9A-Fa-f]{2}))+"
(fn [[match ignore]]
(String.
(byte-array (map
#(Integer/parseInt (apply str (drop 1 %)) 16)
(partition 3 match)))
"Windows-1252"))))
It fixes the encoding using (String. ... "Encoding") on all consecutive runs of quoted-printable-encoded characters. The original function was trying to decode pairs, so it would skip things like =3D, which is the quoted-printable entity for =.
2014-07-15 13:36:26 SMS\n=D0=AF =D0=B4=D0=B5=D0=BB=D0=B0=D1=8E, which when converted using the above code, is:2014-07-15 13:36:26 SMS\nЯ делаÑ, rather than the expected2014-07-15 13:36:26 SMS\nЯ делаю. - Zaz