9
votes

I'm reading gzip compressed files using zlib. Then you open a file using

gzFile gzopen(const char *filepath, const char *mode);

How do you handle Unicode file paths that are stored as const wchar_t* on Windows?

On UNIX-like platforms you can just convert the file path to UTF-8 and call gzopen(), but that will not work on Windows.

5
Not sure, but I'd expect it to accept UTF8, so that you can convert your UTF16 into UTF8 and pass the result as char*.sharptooth
Have you tried using wcstombs or iconv ?Appleman1234
@Appleman: On Windows wcstombs will, at least by default, convert the string to Windows-1252. Characters that can not be represented as Windows-1252 will be replaced by various substitution characters. If that happens the converted string can not be used as a file path.Johan Råde
@Hans Passant: I'm writing a library whose interface takes a file path as a boost::filesystem::path and whose implementation may read the file using the ZLib library. Then this is an issue.Johan Råde
@Hans Passant: I read that, but I did not understand what you mean. I can create files on my Windows computer with names such as "黒死.txt". And I can open that file by passing its name (as a UTF-16 encoded wide string) to _wfopen(...)Johan Råde

5 Answers

14
votes

The next version of zlib will include this function where _WIN32 is #defined:

gzFile gzopen_w(const wchar_t *path, char *mode);

It works exactly like gzopen(), except it uses _wopen() instead of open().

I purposely did not duplicate the second argument of _wfopen(), and as a result I did not call it _wgzopen() to avoid possible confusion with that function's arguments. Hence the name gzopen_w(). That also avoids the use of the C-reserved name space.

12
votes

First of all, what is a filename?

On Unix-like systems

A filename is a sequence of bytes terminated by zero. The kernel doesn't need to care about character encoding (except to know the ASCII code for /).

However, it's more convenient from the users' point of view to interpret filenames as sequences of characters, and this is done by a character encoding specified as part of the locale. Unicode is supported by making UTF-8 locales available.

In C programs, files are represented with ordinary char* strings in functions like fopen. There is no wide-character version of the POSIX API. If you have a wchar_t* filename, you must explicitly convert it to a char*.

On Windows NT

A filename is a sequence of UTF-16 code units. In fact, all string manipulation in Windows is done in UTF-16 internally.

All of Microsoft's C(++) libraries, including the Visual C++ runtime library, use the convention that char* strings are in the locale-specific legacy "ANSI" code page, and wchar_t* strings are in UTF-16. And the char* functions are just backwards-compatibility wrappers around the new wchar_t* functions.

So, if you call MessageBoxA(hwnd, text, caption, type), that's essentially the same as calling MessageBoxW(hwnd, ToUTF16(text), ToUTF16(caption), type). And when you call fopen(filename, mode), that's like _wfopen(ToUTF16(filename), ToUTF16(mode)).

Note that _wfopen is one of many non-standard C functions for working with wchar_t* strings. And this isn't just for convenience; you can't use the standard char* equivalents because they limit you to the "ANSI" code page (which can't be UTF-8). For example, in a windows-1252 locale, you can't (easily) fopen the file שלום.c, because there's just no way to represent those characters in a narrow string.

In cross-platform libraries

Some typical approaches are:

  1. Use Standard C functions with char* strings, and just don't give a 💩 about support for non-ANSI characters on Windows.
  2. Use char* strings but interpret them as UTF-8 instead of ANSI. On Windows, write wrapper functions that take UTF-8 arguments, convert them to UTF-16, and call functions like _wfopen.
  3. Use wide character strings everywhere, which is like #2 except that you need to write wrapper functions for non-Windows systems.

How does zlib handle filenames?

Unfortunately, it appears to use the naïve approach #1 above, with open (rather than _wopen) used directly.

How can you work around it?

Besides the solutions already mentioned (my favorite of which is Appleman1234's gzdopen suggestion), you could take advantage of symbolic links to give the file an alternative all-ASCII name which you could then safely pass to gzopen. You might not even have to do that if the file already has a suitable short name.

4
votes

You have the following options

 #ifdef _WIN32 

 #define F_OPEN(name, mode) _wfopen((name), (mode))

 #endif    
  1. Patch zlib so that it uses _wfopen on Windows rather than fopen , using something similar to the above in zutil.h

  2. Use _wfopen or _wopen instead of gzopen, and pass the return value to gzdopen.

  3. Use libiconv or some other library to change the file enconding to ASCII from your given Unicode encoding, and pass the ASCII string to gzopen. If libiconv fails you handle the error and prompt the user to rename the file.

For more information regarding iconv , see An example of iconv. That example uses Japanese to UTF-8, but it wouldn't be a large leap to change the destination encoding to ASCII or ISO 8859-1.

For more information regarding zlib and non ANSI character conversion see here

3
votes

Here is an implementation of Appleman's option #2. The code has been tested.

#ifdef _WIN32

gzFile _wgzopen(const wchar_t* fileName, const wchar_t* mode)
{
    FILE* stream = NULL;
    gzFile gzstream = NULL;
    char* cmode = NULL;         // mode converted to char*
    int n = -1;

    stream = _wfopen(fileName, mode);

    if(stream)
        n = wcstombs(NULL, mode, 0);
    if(n != -1)
        cmode = (char*)malloc(n + 1);
    if(cmode) {
        wcstombs(cmode, mode, n + 1);
        gzstream = gzdopen(fileno(stream), cmode);
    }

    free(cmode);
    if(stream && !gzstream) fclose(stream);
    return gzstream;
}

#endif

I have made both filename and mode const wchar_t* for consistency with Windows functions such as

FILE* _wfopen(const wchar_t* filename, const wchar_t* mode);
1
votes

Here is my own version of unicode helper function, tested slightly better than version above.

static void GetFlags(const char* mode, int& flags, int& pmode)
{
    const char* _mode = mode;

    flags = 0;      // == O_RDONLY
    pmode = 0;      // pmode needs to be obtained, otherwise file gets read-only attribute, see 
                    // http://stackoverflow.com/questions/1412625/why-is-the-read-only-attribute-set-sometimes-for-files-created-by-my-service

    for( ; *_mode ; _mode++ )
    {
        switch( tolower(*_mode) )
        {
            case 'w':
                flags |= O_CREAT | O_TRUNC;
                pmode |= _S_IWRITE;
                break;
            case 'a':
                flags |= O_CREAT | O_APPEND;
                pmode |= _S_IREAD | _S_IWRITE;
                break;
            case 'r':
                pmode |= _S_IREAD;
                break;
            case 'b':
                flags |= O_BINARY;
                break;
            case '+':
                flags |= O_RDWR;
                pmode |= _S_IREAD | _S_IWRITE;
                break;
        }
    }

    if( (flags & O_CREAT) != 0 && (flags & O_RDWR) == 0 )
        flags |= O_WRONLY;
} //GetFlags


gzFile wgzopen(const wchar_t* fileName, const char* mode)
{
    gzFile gzstream = NULL;
    int f = 0;
    int flags = 0;
    int pmode = 0;

    GetFlags(mode, flags, pmode);

    f = _wopen(fileName, flags, pmode );

    if( f == -1 )
        return NULL;

    // gzdopen will also close file handle.
    gzstream = gzdopen(f, mode);
    if(!gzstream)
        _close(f);
    return gzstream;
}