0
votes

I am trying to crawl website using Nutch. I use commands:

  • inject for URLs injection to DB
  • loop of generate/fetch/parse/updatedb

I noticed what Nutch fetches already fetched URLs on each loop iteration.

Config I have made:

  • added filter to regex-urlfilter.txt

Added config to nutch-site.xml:

  • http.agent.name set value MyNutchSpider
  • http.robots.agents set value to MyNutchSpider
  • file.content.limit -1
  • http.content.limit -1
  • ftp.content.limit -1
  • fetcher.server.delay set value to 1.0
  • fetcher.threads.fetch set value to 1
  • parser.character.encoding.default
  • plugin.includes added protocol protocol-httpclient
  • set storage.data.store.class to use custom storage

I use commands:

  • bin/nutch generate -topN 10
  • bin/nutch fetch -all
  • bin/nutch parse -all
  • bin/nutch updatedb -all

I have tried versions of Nutch 2.2.1 with MySQL and 2.3 with MongoDB. Result is same already fetched URLs are re-feched on each crawl loop iteration.

What I should to do to fetch all not crawled URLs?

1

1 Answers

1
votes

This is an open issue for Nutch 2.X. I faced it this weekend too.

The fix is scheduled for release 2.3.1: https://issues.apache.org/jira/browse/NUTCH-1922.