1
votes

I am new to Nutch and Solr. I have just taken over the activities and I have to crawl and index my website now.

These are the steps I have been asked to follow.

  • Delete the crawl folders (apache-nutch-1.10\crawl)

  • Remove the existing indexes:

Solr-Admin-> Skyweb->Documents->Document Type (xml) and execute :

  • Go to Solr-Admin -> Core Admin -> Click on 'Reload' and then 'Optimize'
  • And run the crawl job using the following command:

bin/crawl -i -D solr.server.url=http://IP:8080/solr/website/ urls/ crawl/ 5

I did some research and felt that doing these tasks manually is overwork and the script should take care of all the above tasks.

So my queries\concerns are:

Doesn't the above script take care of the entire process? Do I still need to delete the crawl folders and clear the existing indexes manually?

What is the relevance of the Admin tasks - 'Reload' and 'Optimize'?

Can I cron schedule the the crawl script to run weekly and will it take care of the entire process?

How else can I automate the crawling and indexing to run periodically?

2

2 Answers

3
votes

There are two possible ways:

  1. configure Nutch to re-fetch all previously crawled pages after one week, see the property db.fetch.interval.default. Keep the crawl/ folder and the Solr index as is. Nutch will automatically delete gone pages from Solr. Ev. you should delete old segments after each crawl (rm -rf crawl/segments/*) to avoid that the disk fills up over time.

  2. launch each crawl from scratch (just remove the folder crawl/ before calling bin/crawl. It's also possible to delete a Solr index from command-line, e.g. by firing: curl http://localhost:8983/solr/update --data '<delete><query>*:*</query></delete>' -H 'Content-type:text/xml; charset=utf-8' curl http://localhost:8983/solr/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'

It's not difficult to combine these commands and including the call of bin/crawl in a short shell script which can be called by cronjob. Of course, it's also easy to modify the script bin/crawl to your own needs.

0
votes

Chillax ! Just Relax !! Have you ever looked into Apache ManifoldCF project ? It provides a clean interface to crawl web pages , better than Nutch , to lessen the hassle . It is open Source and within a matter of few minutes you can set up a Job with all your parameters and index your data in Server of your choice , be it Solr , Elastic Search , etc . And , once you set up a Job , you can save settings , so that you don't have to intermittently configure things . Also it supports a Rest API that surely allows you to automate your jobs on the fly . Google it . You won't regret . Hope that helps :) .