Determine character encoding in Ruby 1.9.3

Question

My Rails 3.2.2 / Ruby 1.9.3 application gets search requests such as:

http://booko.com.au/books/search?q=Fran%E7ois+Vergniolle+de+Chantal

Ruby / Rails takes this query and decodes it - but assumes it's UTF-8. At some point I get a :

invalid byte sequence in UTF-8
app/models/product.rb:694:in `upcase'

I think it's doing something like this:

q="Fran%E7ois+Vergniolle+de+Chantal"
=> "Fran%E7ois+Vergniolle+de+Chantal"

CGI.unescape( q )
=> "Fran\xE7ois Vergniolle de Chantal"

CGI.unescape( q ).encoding.name
=> "UTF-8"

CGI.unescape( q ).valid_encoding?
=> false

What is the correct way of dealing with this? I'd like to transcode it to the correct encoding - but how do I determine the current encoding? What I'm currently doing, is just assuming it's LATIN1:

q.encode!("ISO-8859-1", "UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Or doing something I found on a blog somewhere:

q = q.unpack('C*').pack('U*')

What's the right way of dealing with this?

Edit The server is correctly sending "Content-Type: text/html; charset=utf-8" header to the client. The page also contains the appropriate meta tag: 'meta http-equiv="content-type" content="text/html;charset=UTF-8"'

Not sure if there's another method to tell the client which encodings to use?

What if you will write # coding: UTF-8 at the top of app/models/product.rb. I think it should solve that error. Will you satisfied with this solution? — ck3g
You would have to use some kind of dictionary in order to determine the correct encoding, as the same byte 0xE7 could be (and indeed is) a valid character in encodings other than Latin1. — Mladen Jablanović
@ck3g The data is coming from a web request so that won't help. The app already thinks it's UTF-8 when it isn't. — dkam
@MladenJablanović Yes - that would be a solution. Does such a dictionary exist? As 0xE7 exist in multiple encodings, you'd want to sort by most common I guess - unless there were multiple characters to narrow down the choice. — dkam

dkam dkam · Accepted Answer · 2012-03-22T09:35:00

The character ç is encoded in the URL as %E7. This is how ISO-8859-1 encodes ç. The ISO-8859-1 character set represents a character with a single byte. The byte which represents ç can be expressed in hex as E7.

In Unicode, ç has a code point of U+00E7. Unlike ISO-8859-1, in which the code point (E7) is the same as it's encoding (E7 in hex), Unicode has multiple encoding schemes such as UTF-8, UTF-16 and UTF-32. UTF-8 encodes U+00E7 (ç) as two bytes - C3 A7.

See here for other ways to encode ç.

As to why U+00E7 and E7 in ISO-8859-1 both use "E7", the first 256 code points in Unicode were made identical to ISO-8859-1.

If this URL were UTF-8, ç would be encoded as %C3%A7. My (very limited) understanding of RFC2616 is that the default encoding for a URL is (currently) ISO-8859-1. Therefore, this is most likely ISO-8859-1 encoded URL. Which means, the best approach is probably to check that the encoding is valid and if not, assume it is ISO-8859-1 and transcode it to UTF-8:

unless query.valid_encoding?
    query.encode!("UTF-8", "ISO-8859-1", :invalid => :replace, :undef => :replace, :replace => "")
end

Here's the process in IRB (plus an escaping at the end for fun)

a = CGI.unescape("%E7")
=> "\xE7"
a.encoding
=> #<Encoding:UTF-8>
a.valid_encoding?
=> false
b = a.encode("UTF-8", "ISO-8859-1")    # From ISO-8859-1 -> UTF-8
=> "ç"
b.encoding
=> #<Encoding:UTF-8>
CGI.escape(b)
=> "%C3%A7"

Determine character encoding in Ruby 1.9.3

2 Answers