I am using nutch 2.3.1
I preform the commands to crawl a site:
- ./nutch inject ../urls/seed.txt
- ./nutch generate -topN 2500
- ./nutch fetch -all
The problem is, nutch is only crawling the first URL (the one specified in seeds.txt). The data is only the HTML from the first URL/page.
All the other URLS that were accumulated by the generate command are not actually crawled.
I cannot get nutch to crawl the other generated urls...I also cannot get nutch to crawl the entire website. What are the options that I need to use to crawl an entire site?
Does anyone have any insights or recommendations?
Thank you so much for your help