UTF8 versus Latin1

Question

I am trying to understand the difference between Latin1 and UTF8 and for the most part I get it, however, when testing I am getting some weird results and could use some help clarifying

I am testing with 'é' (Latin small letter E with acute) and the link below shows the hex c3a9

I setup a database and specified utf8 as the character set, then created a table with utf8 as the character set and inserted a record with the character 'é' after setting the connection and client character set to UTF8

when I do a select hex(field), field from test_table I get:

hex(field), field C3A9, é

This is fine and consistent with what I read, however, when I do the exact same using a latin1 character set I get the following:

hex(field), field C3A9, é

but if I enter char(x'E9') which should be the single byte Latin1 equivalent value for é I manage to get it to display correctly using 'set names UTF8' but it doesn't show up correctly when setting the connection and client to Latin1

Can anyone clarify? - shouldn't Latin1 characters be single byte (Hex E9) in both UTF8 and Latin1? or am I completely misunderstanding it all?

Thanks

Could you post a SHOW CREATE TABLE so we can see the charset of the table. — redreddington
Also how are you viewing these results? If I set my terminal to ISO-8859-1 when I do select hex("é"); I get E9. When it's set to UTF-8 I get C3A9. — redreddington
mysql> show create table test_latin\G *************************** 1. row *************************** Table: test_latin Create Table: CREATE TABLE test_latin ( a varchar(500) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 — digital_paki
I had been using Mysql Workbench, but I tried using a terminal and switching between latin1 and utf8 and see what you report as well — digital_paki

Rick James Rick James · Accepted Answer · 2015-12-10T02:55:28

latin1 encoding has only 1-byte codes.

The first 128 codes (7-bits) are mostly identical between latin1 and utf8.

é is beyond the 128; it's 1-byte, 8-bit latin1 hex is E9 (as you observed). For utf8, it takes 2 bytes: C3A9. For most Asian characters, utf8 takes 3 byte; latin1 cannot represent those characters.

MySQL has the confusing command SET NAMES utf8. That announces that the client's encoding is utf8, and instructs the communication between client and server to convert between the column's CHARACTER SET and utf8 when reading/writing.

If you have SET NAMES latin1 (the old default), but the bytes in the client are encoded utf8, then you are 'lying', and various nasty things happen. But there is no immediate clue that something is wrong.

Checklist for going entirely utf8:

Bytes in client are utf8-encoded
SET NAMES utf8 (or equivalent parameter during connecting to MySQL)
CHARACTER SET utf8 on column or table declaration
<meta ... UTF-8> in html

UTF8 versus Latin1

2 Answers