I know this question is old and already has an accepted answer, but I want to offer a few examples (hoping it'll be useful to someone).
As far as I know old ASCII characters took one byte per character.
Right. Actually, since ASCII is a 7-bit encoding, it supports 128 codes (95 of which are printable), so it only uses half a byte (if that makes any sense).
How many bytes does a Unicode character require?
Unicode just maps characters to codepoints. It doesn't define how to encode them. A text file does not contain Unicode characters, but bytes/octets that may represent Unicode characters.
I assume that one Unicode character can contain every possible
character from any language - am I correct?
No. But almost. So basically yes. But still no.
So how many bytes does it need per character?
Same as your 2nd question.
And what do UTF-7, UTF-6, UTF-16 etc mean? Are they some kind Unicode
versions?
No, those are encodings. They define how bytes/octets should represent Unicode characters.
A couple of examples. If some of those cannot be displayed in your browser (probably because the font doesn't support them), go to http://codepoints.net/U+1F6AA
(replace 1F6AA
with the codepoint in hex) to see an image.
- U+0061 LATIN SMALL LETTER A:
a
- Nº: 97
- UTF-8: 61
- UTF-16: 00 61
- U+00A9 COPYRIGHT SIGN:
©
- Nº: 169
- UTF-8: C2 A9
- UTF-16: 00 A9
- U+00AE REGISTERED SIGN:
®
- Nº: 174
- UTF-8: C2 AE
- UTF-16: 00 AE
- U+1337 ETHIOPIC SYLLABLE PHWA:
ጷ
- Nº: 4919
- UTF-8: E1 8C B7
- UTF-16: 13 37
- U+2014 EM DASH:
—
- Nº: 8212
- UTF-8: E2 80 94
- UTF-16: 20 14
- U+2030 PER MILLE SIGN:
‰
- Nº: 8240
- UTF-8: E2 80 B0
- UTF-16: 20 30
- U+20AC EURO SIGN:
€
- Nº: 8364
- UTF-8: E2 82 AC
- UTF-16: 20 AC
- U+2122 TRADE MARK SIGN:
™
- Nº: 8482
- UTF-8: E2 84 A2
- UTF-16: 21 22
- U+2603 SNOWMAN:
☃
- Nº: 9731
- UTF-8: E2 98 83
- UTF-16: 26 03
- U+260E BLACK TELEPHONE:
☎
- Nº: 9742
- UTF-8: E2 98 8E
- UTF-16: 26 0E
- U+2614 UMBRELLA WITH RAIN DROPS:
☔
- Nº: 9748
- UTF-8: E2 98 94
- UTF-16: 26 14
- U+263A WHITE SMILING FACE:
☺
- Nº: 9786
- UTF-8: E2 98 BA
- UTF-16: 26 3A
- U+2691 BLACK FLAG:
⚑
- Nº: 9873
- UTF-8: E2 9A 91
- UTF-16: 26 91
- U+269B ATOM SYMBOL:
⚛
- Nº: 9883
- UTF-8: E2 9A 9B
- UTF-16: 26 9B
- U+2708 AIRPLANE:
✈
- Nº: 9992
- UTF-8: E2 9C 88
- UTF-16: 27 08
- U+271E SHADOWED WHITE LATIN CROSS:
✞
- Nº: 10014
- UTF-8: E2 9C 9E
- UTF-16: 27 1E
- U+3020 POSTAL MARK FACE:
〠
- Nº: 12320
- UTF-8: E3 80 A0
- UTF-16: 30 20
- U+8089 CJK UNIFIED IDEOGRAPH-8089:
肉
- Nº: 32905
- UTF-8: E8 82 89
- UTF-16: 80 89
- U+1F4A9 PILE OF POO:
💩
- Nº: 128169
- UTF-8: F0 9F 92 A9
- UTF-16: D8 3D DC A9
- U+1F680 ROCKET:
🚀
- Nº: 128640
- UTF-8: F0 9F 9A 80
- UTF-16: D8 3D DE 80
Okay I'm getting carried away...
Fun facts: