Summary: The problematic characters are control characters from ISO-8859-1, not intended for display.
Details and Investigation:
Here's a test showing that you are getting valid UTF-8 from Nokogiri and Sinatra:
require 'sinatra'
require 'open-uri'
get '/' do
html = open("http://flybynight.com.br/agenda.php").read
p [ html.encoding, html.valid_encoding? ]
#=> [#<Encoding:ISO-8859-1>, true]
str = html[ /Preview.+?John Digweed/ ]
p [ str, str.encoding, str.valid_encoding? ]
#=> ["Preview M/E/C/A \x96 John Digweed", #<Encoding:ISO-8859-1>, true]
utf8 = str.encode('UTF-8')
p [ utf8, utf8.encoding, utf8.valid_encoding? ]
#=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]
require 'nokogiri'
doc = Nokogiri.HTML(html, nil, 'ISO-8859-1')
p doc.encoding
#=> "ISO-8859-1"
dig = doc.xpath("//div[@class='tit_inter_cnz']")[1]
p [ dig.text, dig.text.encoding, dig.text.valid_encoding? ]
#=> ["Preview M/E/C/A \xC2\x96 John Digweed", #<Encoding:UTF-8>, true]
<<-ENDHTML
<!DOCTYPE html>
<html><head><title>Dig it!</title></head><body>
<p>Here it comes...</p>
<p>#{dig.text}</p>
</body></html>
ENDHTML
end
This properly serves up content with Content-Type:text/html;charset=utf-8
on my computer. Chrome does not show my this character in the browser, however.
Analyzing that response, the same Unicode byte pair comes back for the dash as is seen in the above: \xC2\x96
. This appears to be this Unicode character which seem to be an odd dash.
I would chalk this up to bad source data, and simply throw:
#encoding: UTF-8
at the top of your Ruby source file(s), and then put in:
f = ...text.gsub( "\xC2\x96", "-" ) # Or a better Unicode character
Edit: If you look at the browser test page for that character you will see (at least in in Chrome and Firefox for me) that the UTF-8 literal version is blank, but the hex and decimal escape versions show up. I cannot fathom why this is, but there you have it. The browsers are simply not displaying your character correctly when presented in raw form.
Either make it an HTML entity, or a different Unicode dash. Either way a gsub
is called for.
Edit #2: One more odd note: the character in the source encoding has a hexadecimal byte value of 0x96
. As far as I can tell, this does not appear to be a printable ISO-8859-1 character. As shown in the official spec for ISO-8859-1, this falls in one of the two non-printing regions.
puts [xml.encoding, f[0].text.encoding] #=> ["ISO-8859-1", #<Encoding:UTF-8>]
I'm not sure why libxml or Nokogiri is treating the text value coming from the XML as UTF-8. This occurs even if you modify the XPath to fetch the div instead of the text node. This occurs even with#encoding: ISO-8859-1
magic comment in the document. – Phrogz#force_encoding('ISO-8859-1')
on the result of calling.text
, and then cleanly convert to UTF-8...but I'm not yet convinced that your source document is valid ISO-8859-1. – Phrogz