Convert non-ASCII chars from ASCII-8BIT to UTF-8

Question

I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.

Here is an example of some offending text:

Cancer Res; 71(3); 1-11. ©2011 AACR.\n

That Copyright code expanded looks like this:

Cancer Res; 71(3); 1-11. \xC2\xA92011 AACR.\n

Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:

incompatible character encodings: ASCII-8BIT and UTF-8

I can strip the copyright code out using this regex

str.gsub(/[\x00-\x7F]/n,'?')

to produce this

Cancer Res; 71(3); 1-11. ??2011 AACR.\n

But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...

I see references to using force_encoding but this does not work:

str.force_encoding('utf-8').encode

I realize there are many other people with similar issues but I've yet to see a solution that works.

How are you pulling text from the remote sites? Scraping pages? Please show some sample code, including the HTTP client you are using, and whether you are parsing the pages using Nokogiri, Hpricot or ReXML. This problem could be a result of how you are retrieving the page, and/or how you are parsing the page. Once we know you're pulling the content in a data-safe manner, we can help you with converting the data between code sets. — the Tin Man
Real simple code - open-uri and nokogiri - e.g. doc = Nokogiri::XML(open(url)) then doc.css(...).text to pull out the relevant blocks of text — craic.com
Please show sample code. Is the file you are retrieving HTML or XML? Nokogiri does care about the difference when parsing. Also, provide some URLs, because every site on the internet is different. — the Tin Man
"I see references to using force_encoding but this does not work" What does "does not work mean"? Does it raise an error? Does Ruby segfault? Does your computer catch on fire? Does it replace the contents of the string with the lyrics to Yankee Doodle Dandy? Details, please! :) — Phrogz

Phrogz Phrogz · Accepted Answer · 2011-02-02T14:45:31

This works for me:

#encoding: ASCII-8BIT
str = "\xC2\xA92011 AACR"
p str, str.encoding
#=> "\xC2\xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>

str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>

Convert non-ASCII chars from ASCII-8BIT to UTF-8

4 Answers