Single Crawl script to Crawl website (Nutch) and Index results (Solr)

Question

I am new to Nutch and Solr. I have just taken over the activities and I have to crawl and index my website now.

These are the steps I have been asked to follow.

Delete the crawl folders (apache-nutch-1.10\crawl)
Remove the existing indexes:

Solr-Admin-> Skyweb->Documents->Document Type (xml) and execute :

Go to Solr-Admin -> Core Admin -> Click on 'Reload' and then 'Optimize'
And run the crawl job using the following command:

bin/crawl -i -D solr.server.url=http://IP:8080/solr/website/ urls/ crawl/ 5

I did some research and felt that doing these tasks manually is overwork and the script should take care of all the above tasks.

So my queries\concerns are:

Doesn't the above script take care of the entire process? Do I still need to delete the crawl folders and clear the existing indexes manually?

What is the relevance of the Admin tasks - 'Reload' and 'Optimize'?

Can I cron schedule the the crawl script to run weekly and will it take care of the entire process?

How else can I automate the crawling and indexing to run periodically?

Sebastian Nagel Sebastian Nagel · Accepted Answer · 2016-08-17T10:34:02

There are two possible ways:

configure Nutch to re-fetch all previously crawled pages after one week, see the property db.fetch.interval.default. Keep the crawl/ folder and the Solr index as is. Nutch will automatically delete gone pages from Solr. Ev. you should delete old segments after each crawl (rm -rf crawl/segments/*) to avoid that the disk fills up over time.
launch each crawl from scratch (just remove the folder crawl/ before calling bin/crawl. It's also possible to delete a Solr index from command-line, e.g. by firing: curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8' curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'

It's not difficult to combine these commands and including the call of bin/crawl in a short shell script which can be called by cronjob. Of course, it's also easy to modify the script bin/crawl to your own needs.

Single Crawl script to Crawl website (Nutch) and Index results (Solr)

2 Answers