5
votes

While trying to read a UTF-16 encoded file with hints from this answer, I got the problem that, after reading few thousand characters, the getline-method starts to read in garbage mojibake.

Here is my main:

#include <cstdio>
#include <fstream>
#include <iostream>
#include <codecvt>
#include <locale>

int main(void) {

    std::wifstream wif("test.txt", std::ios::binary);
    setlocale(LC_ALL, "en_US.utf8");
    if (wif.is_open())
    {
        wif.imbue(
            std::locale(
                wif.getloc(),
                new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>
            )
        );

        std::wstring wline;
        while (std::getline(wif, wline))
        {
            std::wcout << wline;
        }

        wif.close();
    } 

    return 0;
}

The test.txt file contains FF, FE byte order mark, followed by 100 lines with 80 'a's in each line. Here is a bash-script that generates test.txt on *nix:

#!/bin/bash

echo -n -e \\xFF\\xFE > test.txt
for i in $(seq 1 100)
do
  for i in $(seq 1 80)
  do
    echo -n -e \\x61\\x00 >> test.txt
  done
  echo -n -e \\x0A\\x00 >> test.txt
done

Here is how I compile and run the main:

g++-8 -std=c++17 -g main.cpp -o m && ./m

What I expected: 8000 'a's are printed.

What actually happened:

After printing few thousand as, the output changes to following garbage:

aaaaaaaaaa愀愀愀愀愀愀愀愀愀愀

and occasionally non-printable characters that look like 0A00 in a rectangle.

The -character has binary codepoint value of 110000100000000, so it looks like a-byte followed by 0-byte.

It seems as if some bytes are lost during reading, and from then on, everything is misaligned, and all the remaining symbols are decoded incorrectly. Or, because the output ends with a 0A00-thingie, it might be that the endianness is reversed after reading few thousand as, but this behavior also wouldn't make any sense whatsoever.

Why does this happen, and what's the easiest way to fix it?

1
Note that std::codecvt_utf16 have been deprecated in the C++17 standard.Some programmer dude
@Someprogrammerdude Thanks. What would be the current alternative? Use the deprecated stuff or write it yourself? If this is the case, I'd prefer to get some deprecation warnings rather than rewriting it myself.Indestruktible
Probably std::codecvt directly.Some programmer dude

1 Answers

1
votes

A simple workaround (but not a general solution)

If you are sure that the input file will have a particular endianness, then you can simply hardcode the endianness as shown in the example in the documentation:

        wif.imbue(
            std::locale(
                wif.getloc(),
                new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>
            )
        );

With a hardcoded std::little_endian, the problem seems to disappear, and the file is read correctly. It probably won't work for files with the opposite endianness.