50
votes

I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.

Here is an example of some offending text:

Cancer Res; 71(3); 1-11. ©2011 AACR.\n

That Copyright code expanded looks like this:

Cancer Res; 71(3); 1-11. \xC2\xA92011 AACR.\n

Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:

incompatible character encodings: ASCII-8BIT and UTF-8

I can strip the copyright code out using this regex

str.gsub(/[\x00-\x7F]/n,'?')

to produce this

Cancer Res; 71(3); 1-11. ??2011 AACR.\n

But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...

I see references to using force_encoding but this does not work:

str.force_encoding('utf-8').encode

I realize there are many other people with similar issues but I've yet to see a solution that works.

4
How are you pulling text from the remote sites? Scraping pages? Please show some sample code, including the HTTP client you are using, and whether you are parsing the pages using Nokogiri, Hpricot or ReXML. This problem could be a result of how you are retrieving the page, and/or how you are parsing the page. Once we know you're pulling the content in a data-safe manner, we can help you with converting the data between code sets.the Tin Man
Real simple code - open-uri and nokogiri - e.g. doc = Nokogiri::XML(open(url)) then doc.css(...).text to pull out the relevant blocks of textcraic.com
Please show sample code. Is the file you are retrieving HTML or XML? Nokogiri does care about the difference when parsing. Also, provide some URLs, because every site on the internet is different.the Tin Man
"I see references to using force_encoding but this does not work" What does "does not work mean"? Does it raise an error? Does Ruby segfault? Does your computer catch on fire? Does it replace the contents of the string with the lyrics to Yankee Doodle Dandy? Details, please! :)Phrogz

4 Answers

72
votes

This works for me:

#encoding: ASCII-8BIT
str = "\xC2\xA92011 AACR"
p str, str.encoding
#=> "\xC2\xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>

str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>
32
votes

There are two possibilities:

  1. The input data is already UTF-8, but Ruby just doesn't know it. That seems to be your case, as "\xC2\xA9" is valid UTF-8 for the copyright symbol. In which case you just need to tell Ruby that the data is already UTF-8 using force_encoding.

    For example "\xC2\xA9".force_encoding('ASCII-8BIT') would recreate the relevant bit of your input data. And "\xC2\xA9".force_encoding('ASCII-8BIT').force_encoding('UTF-8') would demonstrate that you can tell Ruby that it is really UTF-8 and get the desired result.

  2. The input data is in some other encoding and you need Ruby to transcode it to UTF-8. In that case you'd have to tell Ruby what the current encoding is (ASCII-8BIT is ruby-speak for binary, it isn't a real encoding), then tell Ruby to transcode it.

    For example, say your input data was ISO-8859-1. In that encoding the copyright symbol is just "\xA9". This would generate such a bit of data: "\xA9".force_encoding('ISO-8859-1') And this would demonstrate that you can get Ruby to transcode that to UTF-8: "\xA9".force_encoding('ISO-8859-1').encode('UTF-8')

6
votes

I used to do this for a script that scraped Greek Windows-encoded pages, using open-uri, iconv and Hpricot:

doc = open(DATA_URL)
doc.rewind
data = Hpricot(Iconv.conv('utf-8', "WINDOWS-1253", doc.readlines.join("\n")))

I believe that was Ruby 1.8.7, not sure how things are with ruby 1.9

2
votes

I've been having issues with character encoding, and the other answers have been helpful, but didn't work for every case. Here's the solution I came up with that forces encoding when possible and transcodes using '?'s when not possible. Here's the solution:

  def encode str
    encoded = str.force_encoding('UTF-8')
    unless encoded.valid_encoding?
      encoded = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
    end
    encoded
  end

force_encoding works most of the time, but I've encountered some strings where that fails. Strings like this will have invalid characters replaced:

 str = "don't panic: \xD3"
 str.valid_encoding?
 false
 str = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
 "don't panic: ?"
 str.valid_encoding?
 true

Update: I have had some issues in production with the above code. I recommend that you set up unit tests with known problem text to make sure that this code works for you like you need it to. Once I come up with version 2 I'll update this answer.