Why does Nutch (v2.3) crawl only the seed URL, instead of crawling an entire website?

Question

I am trying to crawl an entire, specific website (ignoring external links) using Nutch 2.3 with HBase 0.94.14.

I have followed a step-by-step tutorial (can find it here) on how to set up and use these tools. However, I haven't been able to achieve my goal. Instead of crawling the entire website whose URL I've written in the seed.txt file, Nutch only retrieves that base URL in the first round. I need to run further crawls in order for Nutch to retrieve more URLs.

The problem is I don't know how many rounds I need in order to crawl the entire website, so I need a way to tell Nutch to "keep crawling until the entire website has been crawled" (in other words, "crawl the entire website in a single round").

Here are the key steps and settings I have followed so far:

Put base URL in the seed.txt file.

http://www.whads.com/

Set up Nutch's nutch-site.xml configuration file. After finishing the tutorial, I added a few more properties following suggestions on other StackOverflow questions (none of them, however, seem to have solved the problem for me).

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
        <property>
            <name>http.agent.name</name>
            <value>test-crawler</value>
        </property>
        <property>
            <name>storage.data.store.class</name>
            <value>org.apache.gora.hbase.store.HBaseStore</value>
        </property>
        <property>
            <name>plugin.includes</name>
            <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
        </property>
        <property>
            <name>db.ignore.external.links</name>
            <value>true</value>
        </property>
        <property>
            <name>db.ignore.internal.links</name>
            <value>false</value>
        </property>
        <property>
            <name>fetcher.max.crawl.delay</name>
            <value>-1</value>
        </property>
        <property>
            <name>fetcher.threads.per.queue</name>
            <value>50</value>
            <description></description>
        </property>
        <property> 
            <name>generate.count.mode</name> 
            <value>host</value>
        </property>
        <property> 
            <name>generate.max.count</name> 
            <value>-1</value>
        </property>
</configuration>

Added "accept anything else" rule to Nutch's regex-urlfilter.txt configuration file, following suggestions on StackOverflow and Nutch's mailing list.

# Already tried these two filters (one at a time, 
# and each one combined with the 'anything else' one)
#+^http://www.whads.com
#+^http://([a-z0-9]*.)*whads.com/

# accept anything else
+.

Crawling: I have tried using two different approaches (both yielding the same result, with only one URL generated and fetched on the first round):
- Using bin/nutch (following the tutorial):
```
bin/nutch inject urls
bin/nutch generate -topN 50000
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
```
- Using bin/crawl:
```
bin/crawl urls whads 1
```

Am I still missing something? Am I doing something wrong? Or is it that Nutch can't crawl an entire website in one go?

Thank you so much in advance!

Nutch crawl the seed URLs and collects in-links and out-links from the seed URLs then add those links into Crawldb for next crawl. I think that's why nutch didn't crawl all pages at single run — helpdoc
Outdated, Nutch 2.3 no longer has the "depth" parameter (in fact, bin/nutch crawl is completely deprecated, and bin/crawl is used instead). That's why I said the exact version in the question. Thank you for taking some time to answer anyway! — Gabriel Rodríguez

Hafiz Muhammad Shafiq Hafiz Muhammad Shafiq · Accepted Answer · 2017-08-16T07:04:17

Please update your configuration like following

    <property>
        <name>db.ignore.external.links</name>
        <value>false</value>
    </property>

Actually, you are ignoring external links i.e. do not crawl external URLs

Why does Nutch (v2.3) crawl only the seed URL, instead of crawling an entire website?

3 Answers