0
votes

I am writing a screen scraper that takes a list of urls from a post, then visits the urls and gets a list of all of the links on the page. Then it visits all of the links(original and from the scrapes) and gets a list of images. Everything works fine when I run the job inline (with the exception that it takes 30 sec to finish which is a problem since It takes forever to respond to the API call). For some reason when I take the same code and use a background worker to run it, there are 2 urls that will never update to completed. It is always the same 2 urls.

What is weirder is that I am getting the error message

3 TID-ov9t89ido WARN: NoMethodError: undefined method `search' for #<Mechanize::File:0x007f9d86e77a40>

3 TID-ov9t89ido WARN: /app/app/models/scraper.rb:16:in scrape_images' /app/app/workers/image_worker.rb:5:inperform' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/processor.rb:151:in execute_job' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/processor.rb:133:inblock (2 levels) in process' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/chain.rb:127:in block in invoke' /app/vendor/bundle/ruby/2.2.0/gems/newrelic_rpm-3.12.1.298/lib/new_relic/agent/instrumentation/sidekiq.rb:33:inblock in call' /app/vendor/bundle/ruby/2.2.0/gems/newrelic_rpm-3.12.1.298/lib/new_relic/agent/instrumentation/controller_instrumentation.rb:361:in perform_action_with_newrelic_trace' /app/vendor/bundle/ruby/2.2.0/gems/newrelic_rpm-3.12.1.298/lib/new_relic/agent/instrumentation/sidekiq.rb:29:incall' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/chain.rb:129:in block in invoke' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/server/active_record.rb:6:incall' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/chain.rb:129:in block in invoke' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/server/retry_jobs.rb:74:incall' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/chain.rb:129:in block in invoke' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/server/logging.rb:11:inblock in call' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/logging.rb:31:in with_context' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/server/logging.rb:7:incall' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/chain.rb:129:in block in invoke' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/chain.rb:132:incall' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/middleware/chain.rb:132:in invoke' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/processor.rb:128:inblock in process' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/processor.rb:167:in stats' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/processor.rb:127:inprocess' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/processor.rb:79:in process_one' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/processor.rb:67:inrun' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/util.rb:16:in watchdog' /app/vendor/bundle/ruby/2.2.0/gems/sidekiq-4.1.1/lib/sidekiq/util.rb:24:inblock in safe_thread'

That is coming from this code:

 def self.scrape_images(uri)
    page = get_page(uri)
    base_url = page.uri.to_s
    images = page.search('//img') || []
    qualify_images(uri, images).push(base_url)
  end

I see that Mechanize is not thread safe which I think could be my issue but I don't see how that would give me this error when it works for everything else. Any help would be glorious and thanks for reading.

1
I am adding the answer since I didn't find one on SO when I searched. If Mechanize visits a page that is content type .txt it doesn't return a Page object it returns a File object. I solved it with a guard clause in my case:ruby_newbie

1 Answers

0
votes

I am adding the answer since I didn't find one on SO when I searched. If Mechanize visits a page that is content type .txt it doesn't return a Page object it returns a File object. I solved it with a guard clause in my case:

return [] unless page.class == Mechanize::Page