3
votes

I am using nutch 2.x. So I am trying to use nutch command with depth option as

$: nutch inject ./urls/seed.txt -depth 5

after executing this command getting message like

Unrecognized arg -depth

so when i got failed in this i tried to use nutch crawl as

$: nutch crawl ./urls/seed.txt -depth 5

getting error like

Command crawl is deprecated, please use bin/crawl instead

So i tried to use crawl command to crawl urls in seed.txt with the depth option in that case it is asking for solr but i am not using solr

so my question is how to crawl a website by specifying depth

1

1 Answers

1
votes

My question is what do you want to do by crawling the page and not indexing it in SOLR?

Answer to your question:

If you want to use Nutch Crawler and you don want to index it into SOLR then remove following piece of code from crawl script.

http://technical-fundas.blogspot.com/2014/07/crawl-your-website-using-nutch-crawler.html

Answer to you other question:

How to get the HTML content for all the links that has been crawled by Nutch(check this link):

How to get the html content from nutch

This will definitely resolve your issue.