0
votes

I am trying to crawl an entire, specific website (ignoring external links) using Nutch 2.3 with HBase 0.94.14.

I have followed a step-by-step tutorial (can find it here) on how to set up and use these tools. However, I haven't been able to achieve my goal. Instead of crawling the entire website whose URL I've written in the seed.txt file, Nutch only retrieves that base URL in the first round. I need to run further crawls in order for Nutch to retrieve more URLs.

The problem is I don't know how many rounds I need in order to crawl the entire website, so I need a way to tell Nutch to "keep crawling until the entire website has been crawled" (in other words, "crawl the entire website in a single round").

Here are the key steps and settings I have followed so far:

  1. Put base URL in the seed.txt file.

    http://www.whads.com/


  1. Set up Nutch's nutch-site.xml configuration file. After finishing the tutorial, I added a few more properties following suggestions on other StackOverflow questions (none of them, however, seem to have solved the problem for me).

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <!-- Put site-specific property overrides in this file. -->
    <configuration>
            <property>
                <name>http.agent.name</name>
                <value>test-crawler</value>
            </property>
            <property>
                <name>storage.data.store.class</name>
                <value>org.apache.gora.hbase.store.HBaseStore</value>
            </property>
            <property>
                <name>plugin.includes</name>
                <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
            </property>
            <property>
                <name>db.ignore.external.links</name>
                <value>true</value>
            </property>
            <property>
                <name>db.ignore.internal.links</name>
                <value>false</value>
            </property>
            <property>
                <name>fetcher.max.crawl.delay</name>
                <value>-1</value>
            </property>
            <property>
                <name>fetcher.threads.per.queue</name>
                <value>50</value>
                <description></description>
            </property>
            <property> 
                <name>generate.count.mode</name> 
                <value>host</value>
            </property>
            <property> 
                <name>generate.max.count</name> 
                <value>-1</value>
            </property>
    </configuration>
    

  1. Added "accept anything else" rule to Nutch's regex-urlfilter.txt configuration file, following suggestions on StackOverflow and Nutch's mailing list.

    # Already tried these two filters (one at a time, 
    # and each one combined with the 'anything else' one)
    #+^http://www.whads.com
    #+^http://([a-z0-9]*.)*whads.com/
    
    # accept anything else
    +.
    

  1. Crawling: I have tried using two different approaches (both yielding the same result, with only one URL generated and fetched on the first round):

    • Using bin/nutch (following the tutorial):

      bin/nutch inject urls
      bin/nutch generate -topN 50000
      bin/nutch fetch -all
      bin/nutch parse -all
      bin/nutch updatedb -all
      
    • Using bin/crawl:

      bin/crawl urls whads 1
      

Am I still missing something? Am I doing something wrong? Or is it that Nutch can't crawl an entire website in one go?

Thank you so much in advance!

3
Nutch crawl the seed URLs and collects in-links and out-links from the seed URLs then add those links into Crawldb for next crawl. I think that's why nutch didn't crawl all pages at single runhelpdoc
Outdated, Nutch 2.3 no longer has the "depth" parameter (in fact, bin/nutch crawl is completely deprecated, and bin/crawl is used instead). That's why I said the exact version in the question. Thank you for taking some time to answer anyway!Gabriel Rodríguez

3 Answers

0
votes

Please update your configuration like following

    <property>
        <name>db.ignore.external.links</name>
        <value>false</value>
    </property>

Actually, you are ignoring external links i.e. do not crawl external URLs

0
votes

After playing around with Nutch for a few more days trying everything I found on the Internet, I ended up giving up. Some people said it is no longer possible to crawl an antire website in one go with Nutch. So, in case anyone having the same problem stumbles upon this question, do the same I did: drop Nutch and use something like Scrapy (Python). You need to manually set up the spiders, but it works like a charm, is far more extensible and faster, and the results are better.

0
votes

Did you try by using -1 at the end. I can see you are using 1 at the end which runs the crawl only once.