I am using the below command in Nutch 2.3.1 with MongoDB storage. When it is crawling, the process is by pressing CTRL+C. After that, if I try to run the same crawl script, it is not simply breaking without any error. It exits in the second iteration.
Command used : runtime/local/bin/crawl urls/ 'crawlDb' 10
Output:
ParserJob: finished at 2018-03-02 19:48:31, time elapsed: 00:00:02 CrawlDB update for crawlDb /Users/rajeevprasanna/Desktop/nutch-cassandra/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1520000291-27137 -crawlId crawlDb DbUpdaterJob: starting at 2018-03-02 19:48:31 DbUpdaterJob: batchId: 1520000291-27137 DbUpdaterJob: finished at 2018-03-02 19:48:34, time elapsed: 00:00:02 Skipping indexing tasks: no SOLR url provided. Fri Mar 2 19:48:34 IST 2018 : Iteration 2 of 10 Generating batchId Generating a new fetchlist /Users/rajeevprasanna/Desktop/nutch-cassandra/apache-nutch-2.3.1/runtime/local/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -topN 50000 -noNorm -noFilter -adddays 0 -crawlId crawlDb -batchId 1520000314-30627 GeneratorJob: starting at 2018-03-02 19:48:34 GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: false GeneratorJob: normalizing: false GeneratorJob: topN: 50000 GeneratorJob: finished at 2018-03-02 19:48:37, time elapsed: 00:00:02 GeneratorJob: generated batch id: 1520000314-30627 containing 0 URLs Generate returned 1 (no new segments created) Escaping loop: no more URLs to fetch now Rajeevs-MacBook-Pro:apache-nutch-2.3.1 rajeevprasanna$