0
votes

Wonder if anyone could help on this issue?

As we know .txt files encoded with UTF-8 and Unicode(UTF-16) have hidden characters.

I am writing a program that takes a selected .txt file with different encoding UTF-8, and Unicode(UTF-16). I need get the first character of the string and store it. What I need to do with that string is take it and put it into a separate string and use std::stoi to get the int value of the hidden character.

    //OPEN THE FILE IN BINARY
  std::fstream mazeFile(mazeFileLoc, std::ios::in | std::ios::binary);

  if (mazeFile.is_open())
  {
      //STORE THE FIRST CHARACTER AS AN CHAR VALUE
    char test = mazeFile.get();
    std::cout << "First Character is : " << test << std::endl;

    //PUT THE CHAR VALUE IN A STRING
    std::string strTest;
    strTest.insert(strTest.begin(), test);
    std::cout << "String First Character is : " << strTest << std::endl;

    //USE STOI TO GET THE INT VALUE OF STRING
    int testIntVal = std::stoi(strTest);
    std::cout << "Int Value of first character is : " << testIntVal << std::endl;

    mazeFile.close();
  }

The issue that I am having is it is flagging an error during run time when I use stoi.

Does anyone know why this may be flagging an error and not converting it?

Git Link : https://github.com/xSwalshx/ANN.git

1
"As we know .txt files encoded with UTF-8 and Unicode(UTF-16) have hidden characters.": What does this mean? Are you referring to a BOM (which is optional and variously recommended and not recommended)? - Tom Blodget

1 Answers

0
votes

std::stoi needs exception handling as follows:

int testIntVal; 
try
{
    testIntVal = std::stoi(strTest);
    std::cout << "Int Value of first character is : " << testIntVal << std::endl;
}
catch(...)
{
    std::cout << "not a valid integer\n";
}

This is not the right way to check the files encoding.

You have to check for BOM (Byte Order Mark), if the file has BOM, you can be certain of the format.

If the file doesn't have BOM, then you have to guess what the format is, you can't be sure. If text viewer shows the content as "123", then this is stored as

0x31 0x32 0x33 //in UTF8 (same for ASCII characters)
0x31 0x00 0x32 0x00 0x33 0x00 //in UTF16
0x00 0x31 0x00 0x32 0x00 0x33 //in UTF16 big-endian

Note that UTF16-LE has zeros for the even bytes in ASCII characters, UTF16-LE has zeros for odd bytes, and UTF8 doesn't have zeros. You can start by a weak assumption that the file contains ASCII characters only. Then take a guess at the encoding. See the example below.

To make things easier, you should use UTF8 to store text. In Windows just convert UTF16 to UTF8 and store it, then read UTF8 and convert to UTF16. This will be compatible with other systems as well.

const int FORMAT_UTF8 = 0;
const int FORMAT_UTF16 = 1;
const int FORMAT_UTF16BE = 2;

int get_file_encoding(const char* filename)
{
    printf("filename: %s ", filename);
    unsigned char buf[100] = { 0 };
    std::ifstream fin(filename, std::ios::binary);
    fin.read((char*)buf, sizeof(buf));
    int size = fin.gcount();

    //check for BOM
    if(size >= 3 && memcmp(buf, "\xef\xbb\xbf", 3) == 0)
    {
        printf("UTF8\n");
        return FORMAT_UTF8;
    }

    if(size >= 2 && memcmp(buf, "\xff\xfe", 2) == 0)
    {
        printf("UTF16\n");
        return FORMAT_UTF16;
    }

    if(size >= 2 && memcmp(buf, "\xfe\xff", 2) == 0)
    {
        printf("UTF16 big endian\n");
        return FORMAT_UTF16BE;
    }

    //BOM not found, let's take a guess!
    for(int i = 0; i < size - 1; i += 2)
    {
        if(buf[i + 1] == 0)
        {
            printf("assume UTF16\n");
            return FORMAT_UTF16;
        }

        if(buf[i] == 0)
        {
            printf("assume UTF16 big endian\n");
            return FORMAT_UTF16BE;
        }
    }

    printf("Assume ASCII or UTF8\n");
    return FORMAT_UTF8;
}