In Python 2.7 I can successfully convert the Unicode string "abc\udc34xyz"
to UTF-8 (result is "abc\xed\xb0\xb4xyz"
). But when I pass the UTF-8 string to eg. pango_parse_markup()
or g_convert_with_fallback()
, I get errors like "Invalid byte sequence in conversion input". Apparently the GTK/Pango functions detect the "unpaired surrogate" in the string and (correctly?) reject it.
Python 3 doesn't even allow conversion of the Unicode string to UTF-8 (error: "'utf-8' codec can't encode character '\udc34' in position 3: surrogates not allowed"), but I can run "abc\udc34xyz".encode("utf8", "replace")
to get a valid UTF8 string with the lone surrogate replaced by some other character. That's fine for me, but I need a solution for Python 2.
So the question is: in Python 2.7, how can I convert that Unicode string to UTF-8 while replacing the lone surrogate with some replacement character like U+FFFD? Preferably only standard Python functions and GTK/GLib/G... functions should be used.
Btw. Iconv can convert the string to UTF8 but simply removes the bad character instead of replacing it with U+FFFD.