1
votes

As far as I understand a coded character set maps/assigns numbers (called code points) to an (abstract) character (eg. the german character ü to the code point 00FC in unicode).

This code point can be encoded (e.g. represented in byte patterns) in different ways:

UTF-8 (1 Byte), UTF-16 (2 Bytes) and UTF-32 (4 Bytes)

So the process is:

(abstract) characters ---> maps to code points (coded) character set --> encoding of code points to 1...n bytes

Why this intermediate stage of code points? Why are (abstract) characters not directly mapped to 1...n bytes?

(abstract) characters --> maps to 1...n bytes

This intermediate stage (assign numbers to characters) is also done in other (encoded) character sets. So there must be good reasons for it.

I want to understand why no direct mapping to bytes is done and if there are character sets which doesn't have this intermediate stage and directly maps to bytes.

Thanks in advance...

1

1 Answers

0
votes

Why are (abstract) characters not directly mapped to bytes?

To do that we would have to have one single byte encoding scheme that everyone agreed was best for every possible scenario.

We are a very long way off that being true. UTF-8, -16 and -32—not to mention all the other legacy encodings that are never going away—all have different strengths and are used for different purposes by different communities.

With multiple byte encodings unavoidably in play, you need a unified coded character set behind them, so that each encoding can be mapped back and forth to that character set. The alternative is that you have to have a combinatorial explosion of mapping tables between each possible pair of encodings.

(That's is what we had before Unicode. The tables were incomplete, lossy and inconsistent. It was not good times.)