How to recrawle nutch

Question

I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching.

Now my problem is, when I try to re-crawl some site like trailer.apple.com or any other site, it is always crawl the last crawled urls. Even I have removed the last crawled urls from seeds.txt file and entered the new Urls. But Nutch is not crawling the new Urls.

Can anybody tell me, what actually I am doing wrong.

Also please suggest me any Nutch Plugin that can help for crawling the videos and movies sites.

Any help will really appreciable.

Dragan Menoski Dragan Menoski · Accepted Answer · 2013-02-04T14:57:40

I have the same problem. Nutch re-crawl only the old urls, even they not exist in seed.txt.

First time when I start nutch I do the following:

Add domain "www.domain01.com" in /root/Desktop/apache-nutch 2.1/runtime/local/urls/seed.txt (without quotes)
In /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt, add new line:

# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/
In /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt, add new line:

# accept anything else
^http://([a-z0-9]*.)*www.domain01.com/sport/

... and everything was fine.

Next I made the following changes:

Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt and add two new domains: www.domain02.com and www.domain03.com
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt and add two new lines:

# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/
Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt and add two new lines:

# accept anything else
^http://([a-z0-9]*.)www.domain02.com/sport/
^http://([a-z0-9].)*www.domain03.com/sport/

Next I execute the following commands:

updatedb
bin/nutch inject urls
bin/nutch generate urls
bin/nutch updatedb
bin/nutch crawl urls -depth 3

And nutch still crawl the www.domain01.com

I don't know why ?

I use Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux is started on virtual machine on Windows 7 (x64).

How to recrawle nutch

3 Answers