I am trying to crawl an entire, specific website (ignoring external links) using Nutch 2.3 with HBase 0.94.14.
I have followed a step-by-step tutorial (can find it here) on how to set up and use these tools. However, I haven't been able to achieve my goal. Instead of crawling the entire website whose URL I've written in the seed.txt file, Nutch only retrieves that base URL in the first round. I need to run further crawls in order for Nutch to retrieve more URLs.
The problem is I don't know how many rounds I need in order to crawl the entire website, so I need a way to tell Nutch to "keep crawling until the entire website has been crawled" (in other words, "crawl the entire website in a single round").
Here are the key steps and settings I have followed so far:
Put base URL in the seed.txt file.
http://www.whads.com/
Set up Nutch's nutch-site.xml configuration file. After finishing the tutorial, I added a few more properties following suggestions on other StackOverflow questions (none of them, however, seem to have solved the problem for me).
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>test-crawler</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> </property> <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value> </property> <property> <name>db.ignore.external.links</name> <value>true</value> </property> <property> <name>db.ignore.internal.links</name> <value>false</value> </property> <property> <name>fetcher.max.crawl.delay</name> <value>-1</value> </property> <property> <name>fetcher.threads.per.queue</name> <value>50</value> <description></description> </property> <property> <name>generate.count.mode</name> <value>host</value> </property> <property> <name>generate.max.count</name> <value>-1</value> </property> </configuration>
Added "accept anything else" rule to Nutch's regex-urlfilter.txt configuration file, following suggestions on StackOverflow and Nutch's mailing list.
# Already tried these two filters (one at a time, # and each one combined with the 'anything else' one) #+^http://www.whads.com #+^http://([a-z0-9]*.)*whads.com/ # accept anything else +.
Crawling: I have tried using two different approaches (both yielding the same result, with only one URL generated and fetched on the first round):
Using
bin/nutch
(following the tutorial):bin/nutch inject urls bin/nutch generate -topN 50000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb -all
Using
bin/crawl
:bin/crawl urls whads 1
Am I still missing something? Am I doing something wrong? Or is it that Nutch can't crawl an entire website in one go?
Thank you so much in advance!