Nutch2 not resuming crawl

Question

I am using the below command in Nutch 2.3.1 with MongoDB storage. When it is crawling, the process is by pressing CTRL+C. After that, if I try to run the same crawl script, it is not simply breaking without any error. It exits in the second iteration.

Command used : runtime/local/bin/crawl urls/ 'crawlDb' 10

Output:

ParserJob: finished at 2018-03-02 19:48:31, time elapsed: 00:00:02 CrawlDB update for crawlDb /Users/rajeevprasanna/Desktop/nutch-cassandra/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1520000291-27137 -crawlId crawlDb DbUpdaterJob: starting at 2018-03-02 19:48:31 DbUpdaterJob: batchId: 1520000291-27137 DbUpdaterJob: finished at 2018-03-02 19:48:34, time elapsed: 00:00:02 Skipping indexing tasks: no SOLR url provided. Fri Mar 2 19:48:34 IST 2018 : Iteration 2 of 10 Generating batchId Generating a new fetchlist /Users/rajeevprasanna/Desktop/nutch-cassandra/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId crawlDb -batchId 1520000314-30627 GeneratorJob: starting at 2018-03-02 19:48:34 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: false GeneratorJob: normalizing: false GeneratorJob: topN: 50000 GeneratorJob: finished at 2018-03-02 19:48:37, time elapsed: 00:00:02 GeneratorJob: generated batch id: 1520000314-30627 containing 0 URLs Generate returned 1 (no new segments created) Escaping loop: no more URLs to fetch now Rajeevs-MacBook-Pro:apache-nutch-2.3.1 rajeevprasanna$

Sebastian Nagel Sebastian Nagel · Accepted Answer · 2018-03-04T18:52:00

The reason is as shown: "no more URLs to fetch now". There are no new unfetched links in the web table. To resume from scratch the CrawlDb (web table) in MongoDb needs to be removed.

Nutch2 not resuming crawl

2 Answers