While trying to read a UTF-16 encoded file with hints from this answer, I got the problem that, after reading few thousand characters, the getline
-method starts to read in garbage mojibake.
Here is my main:
#include <cstdio>
#include <fstream>
#include <iostream>
#include <codecvt>
#include <locale>
int main(void) {
std::wifstream wif("test.txt", std::ios::binary);
setlocale(LC_ALL, "en_US.utf8");
if (wif.is_open())
{
wif.imbue(
std::locale(
wif.getloc(),
new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>
)
);
std::wstring wline;
while (std::getline(wif, wline))
{
std::wcout << wline;
}
wif.close();
}
return 0;
}
The test.txt
file contains FF
, FE
byte order mark, followed by 100 lines with 80 'a'
s in each line. Here is a bash-script that generates test.txt
on *nix:
#!/bin/bash
echo -n -e \\xFF\\xFE > test.txt
for i in $(seq 1 100)
do
for i in $(seq 1 80)
do
echo -n -e \\x61\\x00 >> test.txt
done
echo -n -e \\x0A\\x00 >> test.txt
done
Here is how I compile and run the main:
g++-8 -std=c++17 -g main.cpp -o m && ./m
What I expected: 8000 'a'
s are printed.
What actually happened:
After printing few thousand a
s, the output changes to following garbage:
aaaaaaaaaa愀愀愀愀愀愀愀愀愀愀
and occasionally non-printable characters that look like 0A00
in a rectangle.
The 愀
-character has binary codepoint value of 110000100000000
, so it looks like a
-byte followed by 0
-byte.
It seems as if some bytes are lost during reading, and from then on, everything is misaligned, and all the remaining symbols are decoded incorrectly. Or, because the output ends with a 0A00
-thingie, it might be that the endianness is reversed after reading few thousand a
s, but this behavior also wouldn't make any sense whatsoever.
Why does this happen, and what's the easiest way to fix it?
std::codecvt_utf16
have been deprecated in the C++17 standard. – Some programmer dudestd::codecvt
directly. – Some programmer dude