Is there anyway to log the list of urls 'ignored' in Nutch crawl?

Question

I am using Nutch to crawl a list of URLS specified in the seed file with depth 100 and topN 10,000 to ensure a full crawl. Also, I am trying to ignore urls with repeated strings in their path using regex-urlfilter http://rubular.com/r/oSkwqGHrri

However, I am curious to know which urls have been ignored during crawling. Is there anyway i can log the list of urls "ignored" by Nutch while it crawls?

Manisha Verma Manisha Verma · Accepted Answer · 2013-03-24T11:21:41

The links can be found by using the following command

bin/nutch readdb PATH_TO_CRAWL_DB -stats -sort -dump DUMP_FOLDER -format csv

this will generate part-00000 file in dump_folder which will contain the url list and their status respectively.

Those with the status of db_unfetched have been ignored by the crawler.

Is there anyway to log the list of urls 'ignored' in Nutch crawl?

1 Answers