First of all, what is a filename?
On Unix-like systems
A filename is a sequence of bytes terminated by zero. The kernel doesn't need to care about character encoding (except to know the ASCII code for /
).
However, it's more convenient from the users' point of view to interpret filenames as sequences of characters, and this is done by a character encoding specified as part of the locale. Unicode is supported by making UTF-8 locales available.
In C programs, files are represented with ordinary char*
strings in functions like fopen
. There is no wide-character version of the POSIX API. If you have a wchar_t*
filename, you must explicitly convert it to a char*
.
On Windows NT
A filename is a sequence of UTF-16 code units. In fact, all string manipulation in Windows is done in UTF-16 internally.
All of Microsoft's C(++) libraries, including the Visual C++ runtime library, use the convention that char*
strings are in the locale-specific legacy "ANSI" code page, and wchar_t*
strings are in UTF-16. And the char*
functions are just backwards-compatibility wrappers around the new wchar_t*
functions.
So, if you call MessageBoxA(hwnd, text, caption, type)
, that's essentially the same as calling MessageBoxW(hwnd, ToUTF16(text), ToUTF16(caption), type)
. And when you call fopen(filename, mode)
, that's like _wfopen(ToUTF16(filename), ToUTF16(mode))
.
Note that _wfopen
is one of many non-standard C functions for working with wchar_t*
strings. And this isn't just for convenience; you can't use the standard char*
equivalents because they limit you to the "ANSI" code page (which can't be UTF-8). For example, in a windows-1252 locale, you can't (easily) fopen
the file שלום.c
, because there's just no way to represent those characters in a narrow string.
In cross-platform libraries
Some typical approaches are:
- Use Standard C functions with
char*
strings, and just don't give a 💩 about support for non-ANSI characters on Windows.
- Use
char*
strings but interpret them as UTF-8 instead of ANSI. On Windows, write wrapper functions that take UTF-8 arguments, convert them to UTF-16, and call functions like _wfopen
.
- Use wide character strings everywhere, which is like #2 except that you need to write wrapper functions for non-Windows systems.
How does zlib handle filenames?
Unfortunately, it appears to use the naïve approach #1 above, with open
(rather than _wopen
) used directly.
How can you work around it?
Besides the solutions already mentioned (my favorite of which is Appleman1234's gzdopen
suggestion), you could take advantage of symbolic links to give the file an alternative all-ASCII name which you could then safely pass to gzopen
. You might not even have to do that if the file already has a suitable short name.
char*
. – sharptooth