I am trying to crawl website using Nutch. I use commands:
- inject for URLs injection to DB
- loop of generate/fetch/parse/updatedb
I noticed what Nutch fetches already fetched URLs on each loop iteration.
Config I have made:
- added filter to regex-urlfilter.txt
Added config to nutch-site.xml:
- http.agent.name set value MyNutchSpider
- http.robots.agents set value to MyNutchSpider
- file.content.limit -1
- http.content.limit -1
- ftp.content.limit -1
- fetcher.server.delay set value to 1.0
- fetcher.threads.fetch set value to 1
- parser.character.encoding.default
- plugin.includes added protocol protocol-httpclient
- set storage.data.store.class to use custom storage
I use commands:
- bin/nutch generate -topN 10
- bin/nutch fetch -all
- bin/nutch parse -all
- bin/nutch updatedb -all
I have tried versions of Nutch 2.2.1 with MySQL and 2.3 with MongoDB. Result is same already fetched URLs are re-feched on each crawl loop iteration.
What I should to do to fetch all not crawled URLs?