I am new to Nutch and Solr. I have just taken over the activities and I have to crawl and index my website now.
These are the steps I have been asked to follow.
Delete the crawl folders (apache-nutch-1.10\crawl)
Remove the existing indexes:
Solr-Admin-> Skyweb->Documents->Document Type (xml) and execute :
- Go to Solr-Admin -> Core Admin -> Click on 'Reload' and then 'Optimize'
- And run the crawl job using the following command:
bin/crawl -i -D solr.server.url=http://IP:8080/solr/website/ urls/ crawl/ 5
I did some research and felt that doing these tasks manually is overwork and the script should take care of all the above tasks.
So my queries\concerns are:
Doesn't the above script take care of the entire process? Do I still need to delete the crawl folders and clear the existing indexes manually?
What is the relevance of the Admin tasks - 'Reload' and 'Optimize'?
Can I cron schedule the the crawl script to run weekly and will it take care of the entire process?
How else can I automate the crawling and indexing to run periodically?