5
votes

I used to be confusing quite a while :

Confusion on Unicode and Multibyte Articles

After reading up the comments by all contributors, plus :

Looking at an old article (Year 2001) : http://www.hastingsresearch.com/net/04-unicode-limitations.shtml, which talk about unicode :

being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to over 170,000 characters.

and Looking at current "modern" article : http://en.wikipedia.org/wiki/Unicode

The most commonly used encodings are UTF-8 (which uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters), the now-obsolete UCS-2 (which uses 2 bytes for all characters, but does not include every character in the Unicode standard), and UTF-16 (which extends UCS-2, using 4 bytes to encode characters missing from UCS-2).

It seems that in the compilation options in VC2008, the options "Unicode" under Character Sets really means "Unicode encoded in UCS-2" (Or UTF-16? I am not sure)

I try to verify this by running the following code under VC2008

#include <iostream>

int main()
{
    // Use unicode encoded in UCS-2?
    std::cout << sizeof(L"我爱你") << std::endl;
    // Use unicode encoded in UCS-2?
    std::cout << sizeof(L"abc") << std::endl;
    getchar();

    // Compiled using options Character Set : Use Unicode Character Set.
    // print out 8, 8

    // Compiled using options Character Set : Multi-byte Character Set.
    // print out 8, 8
}

It seems that during compilation with Unicode Character Set options, the outcome matched my assumption.

But what about Multi-byte Character Set? What does Multi-byte Character Set means in current "modern" world? :)

5
MBCS means nothing. Today we have Unicode. All you knew before is gone (mostly). - John Saunders
the use of L macro causes compiler to treat both string as "wide character string", hence make sense for the result of (8, 8) you obtained. Removing the L will give result of (7, 4), as per Microsoft standard /shrug - YeenFei
@Pototoswatter: What are you talking about? A string literal has array type, in this case wchar_t const[4]. When you dereference that, the array first decays to a wchar_t const*. Dereferencing that in turn gives you a wchar_t const. Thus, *L"123456789" == L'1' and sizeof(*L"123456789")==sizeof(L'1') - MSalters
@MSalters: you're right; it was coincidence that his strings are a power of 2 size. Corrected in my answer. - Potatoswatter

5 Answers

6
votes

http://en.wikipedia.org/wiki/Multi-byte_character_set

MBCS is a term used to denote a class of character encodings with characters that cannot be represented with a single byte, hence multi-byte character set. In order to properly decode a string in this format, you need a codepage that tells you various byte combinations map to characters. ISO/IEC 8859 defines a set of MBCS standards, but according to Wikipedia, ISO stopped maintaining them in 2004, presumably to focus on Unicode.

So I guess the modern term for MBCS is "deprecated in favor of Unicode".

0
votes

multi-byte means that one character is stored in more than one byte.

extract from wikipedia on utf8:

UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters.

so essentially, utf8 is a multi-byte character set :-).

0
votes

Multi Byte Character Set is a general term for any encoding scheme that can use more than 1 byte to encode a character.

When you hear the term you would normally expect it to be refering to one of the older legacy character sets as in "IBM EBCDIC cp1390 - Japanese Kanji Multi Byte".

All the UNICODE schemes are technically MBCSs but you would expect them to be refered to as "UNICODE" collectively or utf-8, utf-16, or utf-32 specifically.

The only "current" software which uses an MBCS character set is Microsoft Office suite. Which uses the "Windows MBCS". This is almost identical to utf-16 apart from some minor differences. Due to Microsofts early adoption the draft standard some small pieces of the complete standard proved difficult to implement so it stuck with the term "Windows MBCS".

0
votes

In MSVC, the options "Unicode" under Character Sets means that _T("X") expands to L"X". If set to MBCS, _T("X") expands to just "X".

Another consequence is whether the Win32 macro MessageBox() expands to MessageBoxW() or MessageBoxA, as well as macros for all other Win32 functions that come in A/W pairs.

0
votes

It seems that in the compilation options in VC2008, the options "Unicode" under Character Sets really means "Unicode encoded in UCS-2" (Or UTF-16? I am not sure)

It uses Unicode encoded in Utf-16 LE. The Wikipedia article I link to has a note to that affect.

But what about Multi-byte Character Set? What does Multi-byte Character Set means in current "modern" world? :)

MBCS is primarily used in the MSDN documentation to mean DBCS. This is explained in more detail in this blog post. If you want to avoid confusion you can say "MBCS Code Page".