2
votes

I have to crawl all inlinks (maximum) of few URLs. For that I am using Apache Nutch 2.3.1 with hadoop and hbase. Following is the nutch-site.xml file used for this purpose.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
   <name>http.agent.name</name>
   <value>crawler</value>
</property>
<property>
   <name>storage.data.store.class</name>
   <value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
<property>
  <name>plugin.includes</name>
 <value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more|urdu)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
</property>
<property>
  <name>http.robots.agents</name>
  <value>crawler,*</value>
</property>

<!-- language-identifier plugin properties -->

<property>
  <name>lang.ngram.min.length</name>
  <value>1</value>
</property>

<property>
  <name>lang.ngram.max.length</name>
  <value>4</value>
</property>

<property>
  <name>lang.analyze.max.length</name>
  <value>2048</value>
</property>

<property>
  <name>lang.extraction.policy</name>
  <value>detect,identify</value>
</property>

<property>
  <name>lang.identification.only.certain</name>
  <value>true</value>
</property>

<!-- Language properties ends here -->
<property> 
         <name>http.timeout</name> 
         <value>20000</value> 
</property> 
<!-- These tags are included as our crawled documents has started to decrease -->
<property>
 <name>fetcher.max.crawl.delay</name>
 <value>10</value>
</property>
<property>
  <name>generate.max.count</name>
  <value>10000</value>
</property>

<property>
 <name>db.ignore.external.links</name>
 <value>true</value>
</property>
</configuration>

When I crawl few URLs, only seed urls are fetch and then crawling ends with this message

GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 20
GeneratorJob: finished at 2017-04-21 16:28:35, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1492774111-8887 containing 0 URLs
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

A similar problem is stated here but it is for version 1.1 and I have implemented that solution that does not work for my case.

1
Did you find a solution to this problem ?Abhay Pai
You need to follow Cycle after Injecting seed : Generate > Fetch > Parse > UpdateDb. because in single crawl you can not fetch all links, you have to follow this cycle multiple times.helpdoc

1 Answers

2
votes

Can you check your conf/regex-urlfilter.txt whether it's url filtering regex is blocking the intended outlinks.

# accept anything else
+.

As you set db.ignore.external.links to true, so Nutch won't generate outlinks from different hosts. You need to check db.ignore.internal.links property too in your conf/nutch-default.xml whether it's false or not. Otherwise, there will be no outlinks to generate.

<property>
    <name>db.ignore.internal.links</name>
    <value>false</value>
</property>
<property>
    <name>db.ignore.external.links</name>
    <value>true</value>
</property>
<property>

HTH.