1
votes

I am doing one the examples at the mechanize doc site and I want to parse the results using nokogiri.

My problem is that when the following line gets executed:

doc = Nokogiri::HTML(search_results, 'UTF-8' )

the following error occurs:

C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html/document.rb:71:in `parse': undefined method `name' for "UTF-8":String (NoMethodError)
    from C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html.rb:13:in `HTML'
    from mechanize_test.rb:16:in `<main>'

I have installed ruby 1.9 on a windows vista machine

The results returned by mechanize are non-latin (utf8)

The code sample follows.

# encoding: UTF-8

 require 'rubygems'
 require 'mechanize'
 require 'nokogiri'

 agent = Mechanize.new
 agent.user_agent_alias = 'Mac Safari'
 page = agent.get("http://www.google.com/")
 search_form = page.form_with(:name => "f")
 search_form.field_with(:name => "q").value = "invitations"
 search_results = agent.submit(search_form)
 puts search_results.body

 doc = Nokogiri::HTML(search_results, 'UTF-8')
2

2 Answers

5
votes

@Douglas Drouillard

Thanx for looking into this. I found out I made a mistake. The call to nokogiri should have been:

doc = Nokogiri::HTML(search_results.body, 'UTF-8')

Note that search_results is different that search_results.body.

Search_results contains info coming right out of mechanize instantiation while search_resuls.body contains html utf8 info that nokogiri can parse with no problem.

2
votes

This appears to be issue with what Nokogiri expects as parameters to the parse method that is being called. The first issue I see, is that you are passing in the encoding option in the wrong parameter slot,

A parsing example from Nokogiri project page that specifies encoding

Nokogiri.XML('<foo><bar /><foo>', nil, 'EUC-JP')

Notice the encoding is the third parameter, not the second. But that still does not fully explain the behavior you are seeing, as the encoding should simply be ignored.

Per the Nokogiri documentation a call to Nokogiri::HTML() is a convenience method for the parse method.

Code for Nokogiri::HTML::parse

   def parse thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block
      document.parse(thing, url, encoding, options, &block)
   end

The source for the Nokogiri::HTML::Document parse method is a bit long, but here is the relevant part though:

 string_or_io.respond_to?(:encoding)
   unless string_or_io.encoding.name == "ASCII-8BIT"
      encoding ||= string_or_io.encoding.name
   end
 end

Notice string_or_io.encoding.name, this matches the error your saw, undefined method 'name' for "UTF-8":String (NoMethodError).

Does your search_results object has an attribute with a key value pair of {:encoding => 'UTF-8'}? It appears Nokogiri is looking for the encoding to store an object that then has a name attribute of 'UTF-8'.