Incompatible encodings with ruby and Nokogiri HTML

Question

I'm parsing an external HTML page with Nokogiri. That page is encoded with ISO-8859-1. Part of the data I want to extract, contains some – (dash) html entities:

xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
f = xml.xpath("//div[@style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[@class='tit_inter_cnz']/text()")
f[0].text #=> Preview M/E/C/A \u0096 John Digweed

In the last line, the String should be rendered on the browser with a dash. The browser correctly renders it if I specify my page as ISO-8859-1 encoding, however, my Sinatra app uses UTF-8. How can I correctly display that text in the browser? Today is is being displayed as a square with a small number inside. I tried force_encoding('ISO-8859-1'), but then I get a CompatibilityError from Sinatra.

Any clues?

[Edit] Below are screenshots of the app:

-> Firefox with character encoding UTF-8

-> [Firefox with character encoding Western (ISO-8859-1)

It's worth mentioning that in the ISO-8859-1 mode above, the dash is shown correctly, but there is another incorrect character with it just before the dash. Weird :(

Here's a tip from investigation so far: puts [xml.encoding, f[0].text.encoding] #=> ["ISO-8859-1", #<Encoding:UTF-8>] I'm not sure why libxml or Nokogiri is treating the text value coming from the XML as UTF-8. This occurs even if you modify the XPath to fetch the div instead of the text node. This occurs even with #encoding: ISO-8859-1 magic comment in the document. — Phrogz
Exact, Phrogz. Nokogiri always delivers the Node text as UTF-8, despite the fact that the document encoding is ISO-8859-1 — Felipe Lima
You can #force_encoding('ISO-8859-1') on the result of calling .text, and then cleanly convert to UTF-8...but I'm not yet convinced that your source document is valid ISO-8859-1. — Phrogz
"I'm not yet convinced that your source document is valid ISO-8859-1" I agree. The character before the dash is the smoking gun. Correctly encoded HTML will not have that. I think the problem is upstream from Nokogiri and the HTTPd server, either in the rendering app or at the HTML generation. Copying Word docs and pasting them into page layout programs will do this, as will bad scraping code upstream. — the Tin Man

the Tin Man the Tin Man · Accepted Answer · 2011-01-28T19:26:19

After parsing a document in Nokogiri you can tell it to assume a different encoding. Try:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML((open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
doc.encoding = 'UTF-8'

I can't see that page from here, to confirm this fixes the problem, but it's worked for similar problems.

Incompatible encodings with ruby and Nokogiri HTML

3 Answers