0
votes

I am using Nutch to crawl a list of URLS specified in the seed file with depth 100 and topN 10,000 to ensure a full crawl. Also, I am trying to ignore urls with repeated strings in their path using regex-urlfilter http://rubular.com/r/oSkwqGHrri

However, I am curious to know which urls have been ignored during crawling. Is there anyway i can log the list of urls "ignored" by Nutch while it crawls?

1

1 Answers

1
votes

The links can be found by using the following command

bin/nutch readdb PATH_TO_CRAWL_DB -stats -sort -dump DUMP_FOLDER -format csv

this will generate part-00000 file in dump_folder which will contain the url list and their status respectively.

Those with the status of db_unfetched have been ignored by the crawler.