Nutch fetches already fetched URLs

0

votes

I am trying to crawl website using Nutch. I use commands:

inject for URLs injection to DB
loop of generate/fetch/parse/updatedb

I noticed what Nutch fetches already fetched URLs on each loop iteration.

Config I have made:

added filter to regex-urlfilter.txt

Added config to nutch-site.xml:

http.agent.name set value MyNutchSpider
http.robots.agents set value to MyNutchSpider
file.content.limit -1
http.content.limit -1
ftp.content.limit -1
fetcher.server.delay set value to 1.0
fetcher.threads.fetch set value to 1
parser.character.encoding.default
plugin.includes added protocol protocol-httpclient
set storage.data.store.class to use custom storage

I use commands:

bin/nutch generate -topN 10
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all

I have tried versions of Nutch 2.2.1 with MySQL and 2.3 with MongoDB. Result is same already fetched URLs are re-feched on each crawl loop iteration.

What I should to do to fetch all not crawled URLs?

nutch

1 Answers

1

votes

This is an open issue for Nutch 2.X. I faced it this weekend too.

The fix is scheduled for release 2.3.1: https://issues.apache.org/jira/browse/NUTCH-1922.