4
votes

I'm parsing an external HTML page with Nokogiri. That page is encoded with ISO-8859-1. Part of the data I want to extract, contains some – (dash) html entities:

xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
f = xml.xpath("//div[@style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[@class='tit_inter_cnz']/text()")
f[0].text #=> Preview M/E/C/A \u0096 John Digweed

In the last line, the String should be rendered on the browser with a dash. The browser correctly renders it if I specify my page as ISO-8859-1 encoding, however, my Sinatra app uses UTF-8. How can I correctly display that text in the browser? Today is is being displayed as a square with a small number inside. I tried force_encoding('ISO-8859-1'), but then I get a CompatibilityError from Sinatra.

Any clues?

[Edit] Below are screenshots of the app:

-> Firefox with character encoding UTF-8 Firefox with character encoding UTF-8

-> [Firefox with character encoding Western (ISO-8859-1) Firefox with character encoding Western (ISO-8859-1)

It's worth mentioning that in the ISO-8859-1 mode above, the dash is shown correctly, but there is another incorrect character with it just before the dash. Weird :(

3
Here's a tip from investigation so far: puts [xml.encoding, f[0].text.encoding] #=> ["ISO-8859-1", #<Encoding:UTF-8>] I'm not sure why libxml or Nokogiri is treating the text value coming from the XML as UTF-8. This occurs even if you modify the XPath to fetch the div instead of the text node. This occurs even with #encoding: ISO-8859-1 magic comment in the document.Phrogz
Exact, Phrogz. Nokogiri always delivers the Node text as UTF-8, despite the fact that the document encoding is ISO-8859-1Felipe Lima
You can #force_encoding('ISO-8859-1') on the result of calling .text, and then cleanly convert to UTF-8...but I'm not yet convinced that your source document is valid ISO-8859-1.Phrogz
"I'm not yet convinced that your source document is valid ISO-8859-1" I agree. The character before the dash is the smoking gun. Correctly encoded HTML will not have that. I think the problem is upstream from Nokogiri and the HTTPd server, either in the rendering app or at the HTML generation. Copying Word docs and pasting them into page layout programs will do this, as will bad scraping code upstream.the Tin Man
Yep, extremely sh*tty layout, hard to scrapeFelipe Lima

3 Answers

9
votes

After parsing a document in Nokogiri you can tell it to assume a different encoding. Try:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML((open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
doc.encoding = 'UTF-8'

I can't see that page from here, to confirm this fixes the problem, but it's worked for similar problems.

1
votes

Summary: The problematic characters are control characters from ISO-8859-1, not intended for display.

Details and Investigation:
Here's a test showing that you are getting valid UTF-8 from Nokogiri and Sinatra:

require 'sinatra'
require 'open-uri'

get '/' do
  html = open("http://flybynight.com.br/agenda.php").read
  p [ html.encoding, html.valid_encoding? ]
  #=> [#<Encoding:ISO-8859-1>, true]

  str  = html[ /Preview.+?John Digweed/ ]
  p [ str, str.encoding, str.valid_encoding? ]
  #=> ["Preview M/E/C/A \x96 John Digweed", #<Encoding:ISO-8859-1>, true]

  utf8 = str.encode('UTF-8')
  p [ utf8, utf8.encoding, utf8.valid_encoding? ]
  #=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]

  require 'nokogiri'
  doc = Nokogiri.HTML(html, nil, 'ISO-8859-1')
  p doc.encoding
  #=> "ISO-8859-1"

  dig = doc.xpath("//div[@class='tit_inter_cnz']")[1]
  p [ dig.text, dig.text.encoding, dig.text.valid_encoding? ]
  #=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]

  <<-ENDHTML
  <!DOCTYPE html>
  <html><head><title>Dig it!</title></head><body>
  <p>Here it comes...</p>
  <p>#{dig.text}</p>
  </body></html>
  ENDHTML
end

This properly serves up content with Content-Type:text/html;charset=utf-8 on my computer. Chrome does not show my this character in the browser, however.

Analyzing that response, the same Unicode byte pair comes back for the dash as is seen in the above: \xC2\x96. This appears to be this Unicode character which seem to be an odd dash.

I would chalk this up to bad source data, and simply throw:

#encoding: UTF-8

at the top of your Ruby source file(s), and then put in:

f = ...text.gsub( "\xC2\x96", "-" ) # Or a better Unicode character

Edit: If you look at the browser test page for that character you will see (at least in in Chrome and Firefox for me) that the UTF-8 literal version is blank, but the hex and decimal escape versions show up. I cannot fathom why this is, but there you have it. The browsers are simply not displaying your character correctly when presented in raw form.

Either make it an HTML entity, or a different Unicode dash. Either way a gsub is called for.

Edit #2: One more odd note: the character in the source encoding has a hexadecimal byte value of 0x96. As far as I can tell, this does not appear to be a printable ISO-8859-1 character. As shown in the official spec for ISO-8859-1, this falls in one of the two non-printing regions.

0
votes

I work in publishing of scientific manuscripts and there are many dashes. The dash that you are using is not an ASCII dash, it is a unicode dash. Forcing the ISO encoding is probably having the effect of making the dash change.

http://www.fileformat.info/info/unicode/char/96/index.htm

That site is excellent for unicode issues.

The reason you are getting a square is that perhaps your browser does not support this. It is probably correctly rendered. I would keep UTF-8 encoding, and if you want to make that dash so everyone can see it, convert it to an ascii dash.

You may want to try Iconv to convert the characters to ASCII/UTF-8 http://craigjolicoeur.com/blog/ruby-iconv-to-the-rescue