Nutch not crawling entire website

Question

I am using nutch 2.3.1

I preform the commands to crawl a site:

./nutch inject ../urls/seed.txt
./nutch generate -topN 2500
./nutch fetch -all

The problem is, nutch is only crawling the first URL (the one specified in seeds.txt). The data is only the HTML from the first URL/page.

All the other URLS that were accumulated by the generate command are not actually crawled.

I cannot get nutch to crawl the other generated urls...I also cannot get nutch to crawl the entire website. What are the options that I need to use to crawl an entire site?

Does anyone have any insights or recommendations?

Thank you so much for your help

Do Do Do Do · Accepted Answer · 2016-03-10T19:07:14

In the case that Nutch crawls only one specified URL, please check Nutch filter (conf/regex-urlfilter.txt). To crawl all URLs in the seed, the content of regex-urlfilter.txt should be as follows.

# accept all URLs
+.

See details here: http://wiki.apache.org/nutch/NutchTutorial

Hope this helps,

Le Quoc Do

Nutch not crawling entire website

1 Answers