2
votes

I am using Nutch 2.1 integrated with mysql. I had crawled 2 sites and Nutch successfully crawled them and stored the data into the Mysql. I am using Solr 4.0.0 for searching.

Now my problem is, when I try to re-crawl some site like trailer.apple.com or any other site, it is always crawl the last crawled urls. Even I have removed the last crawled urls from seeds.txt file and entered the new Urls. But Nutch is not crawling the new Urls.

Can anybody tell me, what actually I am doing wrong.

Also please suggest me any Nutch Plugin that can help for crawling the videos and movies sites.

Any help will really appreciable.

3

3 Answers

2
votes

I have the same problem. Nutch re-crawl only the old urls, even they not exist in seed.txt.

First time when I start nutch I do the following:

  • Add domain "www.domain01.com" in /root/Desktop/apache-nutch 2.1/runtime/local/urls/seed.txt (without quotes)

  • In /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt, add new line:

    # accept anything else
    ^http://([a-z0-9]*.)*www.domain01.com/sport/

  • In /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt, add new line:

    # accept anything else
    ^http://([a-z0-9]*.)*www.domain01.com/sport/

... and everything was fine.

Next I made the following changes:

  • Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/urls/seed.txt and add two new domains: www.domain02.com and www.domain03.com

  • Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/runtime/local/conf/regex-urlfilter.txt and add two new lines:

    # accept anything else
       ^http://([a-z0-9]*.)www.domain02.com/sport/
       ^http://([a-z0-9]
    .)*www.domain03.com/sport/

  • Remove www.domain01.com from /root/Desktop/apache-nutch-2.1/conf/regex-urlfilter.txt and add two new lines:

    # accept anything else
       ^http://([a-z0-9]*.)www.domain02.com/sport/
       ^http://([a-z0-9]
    .)*www.domain03.com/sport/

Next I execute the following commands:

updatedb
bin/nutch inject urls
bin/nutch generate urls
bin/nutch updatedb
bin/nutch crawl urls -depth 3

And nutch still crawl the www.domain01.com

I don't know why ?

I use Nutch 2.1 on Linux Debian 6.0.5 (x64). And linux is started on virtual machine on Windows 7 (x64).

1
votes

This post is a bit outdated but still valid for the most parts: http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ perhaps the last crawled pages are the ones that change the most. Nutch uses an adaptative algorithm to schedule re-crawls, so when a page is very static it should not be re-crawled very often. You can override how often you want to recrawl using nutch-site.xml. Also, the seed.txt file is supposed to be a seed list, once that you inject the URLs Nutch does not use it anymore(unless you manually re-inject it again)

Another configuration that may help is your regex-urlfilter.txt, if you want to point to an specific place or exclude certain domains/pages etc.

Cheers.

0
votes

u just add ur nutch-site.xml below property tag. it works for me ,,,,,,,check it..........

<property> <name>file.crawl.parent</name> <value>false</value> </property

and u just change regex-urlfilter.txt

# skip file: ftp: and mailto: urls #-^(file|ftp|mailto):
# accept anything else +.

after remove that indexing dir manual or command also like.. rm -r $NUTCH_HOME/indexdir

after run ur crawl cammand...........