7
votes

My Rails 3.2.2 / Ruby 1.9.3 application gets search requests such as:

http://booko.com.au/books/search?q=Fran%E7ois+Vergniolle+de+Chantal

Ruby / Rails takes this query and decodes it - but assumes it's UTF-8. At some point I get a :

invalid byte sequence in UTF-8
app/models/product.rb:694:in `upcase' 

I think it's doing something like this:

q="Fran%E7ois+Vergniolle+de+Chantal"
=> "Fran%E7ois+Vergniolle+de+Chantal"

CGI.unescape( q )
=> "Fran\xE7ois Vergniolle de Chantal"

CGI.unescape( q ).encoding.name
=> "UTF-8"

CGI.unescape( q ).valid_encoding?
=> false

What is the correct way of dealing with this? I'd like to transcode it to the correct encoding - but how do I determine the current encoding? What I'm currently doing, is just assuming it's LATIN1:

q.encode!("ISO-8859-1", "UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Or doing something I found on a blog somewhere:

q = q.unpack('C*').pack('U*')

What's the right way of dealing with this?

Edit The server is correctly sending "Content-Type: text/html; charset=utf-8" header to the client. The page also contains the appropriate meta tag: 'meta http-equiv="content-type" content="text/html;charset=UTF-8"'

Not sure if there's another method to tell the client which encodings to use?

2
What if you will write # coding: UTF-8 at the top of app/models/product.rb. I think it should solve that error. Will you satisfied with this solution?ck3g
@ck3g, nope, it's not about file encoding here.fl00r
You would have to use some kind of dictionary in order to determine the correct encoding, as the same byte 0xE7 could be (and indeed is) a valid character in encodings other than Latin1.Mladen Jablanović
@ck3g The data is coming from a web request so that won't help. The app already thinks it's UTF-8 when it isn't.dkam
@MladenJablanović Yes - that would be a solution. Does such a dictionary exist? As 0xE7 exist in multiple encodings, you'd want to sort by most common I guess - unless there were multiple characters to narrow down the choice.dkam

2 Answers

5
votes

The character ç is encoded in the URL as %E7. This is how ISO-8859-1 encodes ç. The ISO-8859-1 character set represents a character with a single byte. The byte which represents ç can be expressed in hex as E7.

In Unicode, ç has a code point of U+00E7. Unlike ISO-8859-1, in which the code point (E7) is the same as it's encoding (E7 in hex), Unicode has multiple encoding schemes such as UTF-8, UTF-16 and UTF-32. UTF-8 encodes U+00E7 (ç) as two bytes - C3 A7.

See here for other ways to encode ç.

As to why U+00E7 and E7 in ISO-8859-1 both use "E7", the first 256 code points in Unicode were made identical to ISO-8859-1.

If this URL were UTF-8, ç would be encoded as %C3%A7. My (very limited) understanding of RFC2616 is that the default encoding for a URL is (currently) ISO-8859-1. Therefore, this is most likely ISO-8859-1 encoded URL. Which means, the best approach is probably to check that the encoding is valid and if not, assume it is ISO-8859-1 and transcode it to UTF-8:

unless query.valid_encoding?
    query.encode!("UTF-8", "ISO-8859-1", :invalid => :replace, :undef => :replace, :replace => "")
end

Here's the process in IRB (plus an escaping at the end for fun)

a = CGI.unescape("%E7")
=> "\xE7"
a.encoding
=> #<Encoding:UTF-8>
a.valid_encoding?
=> false
b = a.encode("UTF-8", "ISO-8859-1")    # From ISO-8859-1 -> UTF-8
=> "ç"
b.encoding
=> #<Encoding:UTF-8>
CGI.escape(b)
=> "%C3%A7"
0
votes

It seems like it is an url encoded string. For reference here is a list of encoded characters: http://www.degraeve.com/reference/urlencoding.php

Unfortunately the CGI library has problems with utf-8, and if the unescape methods works well with some characters like space, it does not work well with others.

require'cgi'
a = "Fran%E7ois+Vergniolle+de+Chantal"
a= a.gsub('+', ' ').gsub('%E7','ç')
puts a
=> François Vergniolle de Chantal

a = "Fran%E7ois+Vergniolle+de+Chantal"
a = CGI::unescape(a) 
puts a
=> Franis Vergniolle de Chantal

Maybe you can implement your own method using gsub and the list of encoded characters?