5
votes

I am trying to understand the difference between Latin1 and UTF8 and for the most part I get it, however, when testing I am getting some weird results and could use some help clarifying

I am testing with 'é' (Latin small letter E with acute) and the link below shows the hex c3a9

I setup a database and specified utf8 as the character set, then created a table with utf8 as the character set and inserted a record with the character 'é' after setting the connection and client character set to UTF8

when I do a select hex(field), field from test_table I get:

hex(field), field C3A9, é

This is fine and consistent with what I read, however, when I do the exact same using a latin1 character set I get the following:

hex(field), field C3A9, é

but if I enter char(x'E9') which should be the single byte Latin1 equivalent value for é I manage to get it to display correctly using 'set names UTF8' but it doesn't show up correctly when setting the connection and client to Latin1

Can anyone clarify? - shouldn't Latin1 characters be single byte (Hex E9) in both UTF8 and Latin1? or am I completely misunderstanding it all?

Thanks

2
Could you post a SHOW CREATE TABLE so we can see the charset of the table.redreddington
Also how are you viewing these results? If I set my terminal to ISO-8859-1 when I do select hex("é"); I get E9. When it's set to UTF-8 I get C3A9.redreddington
mysql> show create table test_latin\G *************************** 1. row *************************** Table: test_latin Create Table: CREATE TABLE test_latin ( a varchar(500) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1digital_paki
I had been using Mysql Workbench, but I tried using a terminal and switching between latin1 and utf8 and see what you report as welldigital_paki

2 Answers

3
votes

latin1 encoding has only 1-byte codes.

The first 128 codes (7-bits) are mostly identical between latin1 and utf8.

é is beyond the 128; it's 1-byte, 8-bit latin1 hex is E9 (as you observed). For utf8, it takes 2 bytes: C3A9. For most Asian characters, utf8 takes 3 byte; latin1 cannot represent those characters.

MySQL has the confusing command SET NAMES utf8. That announces that the client's encoding is utf8, and instructs the communication between client and server to convert between the column's CHARACTER SET and utf8 when reading/writing.

If you have SET NAMES latin1 (the old default), but the bytes in the client are encoded utf8, then you are 'lying', and various nasty things happen. But there is no immediate clue that something is wrong.

Checklist for going entirely utf8:

  • Bytes in client are utf8-encoded
  • SET NAMES utf8 (or equivalent parameter during connecting to MySQL)
  • CHARACTER SET utf8 on column or table declaration
  • <meta ... UTF-8> in html
2
votes

After recently putting a website through the ringer UTF-8 wise I think this is a case of viewing UTF-8 data in a latin1 table within a UTF-8 encoded page or terminal.

If you are using a terminal you can check this by looking at the character encoding setting of the terminal (in Ubuntu it's Terminal -> Set Character Encoding). If you are using something like PHPMyAdmin, view the page source and look for the charset of the page, or open up Firebug and look at the response headers for the page, it should say "UTF-8".

If you've inserted the data and it's encoded in UTF-8 and it goes into a latin1 table then the data will still be stored in UTF-8, it's only when you start viewing that data or retrieving that data in a different encoding that you start getting the mangled effect.

I've found it's really crucial that when you are working with character encoding that you get everything the same: the page must have a charset of UTF-8, the upstream into the database must be in UTF-8, the database must have a default charset and storage of UTF-8. As soon as you put a different charset in the mix everything goes crazy.