3
votes

I am using apache nutch 2.3 (latest version). I have crawled about 49000 documnts by nutch. From documents mime analysis, crawled data containes about 45000 thouseand text/html documents. But when I saw indexed documents in solr (4.10.3), only about 14000 documents are indexed. Why this huge difference between documents are (45000-14000=31000). If I assume that nutch only index text/html documents, then atleast 45000 documents should be indexed.

What is the problem. How to solve it?

2

2 Answers

0
votes

You should look at your Solr logs, to see if there's anything about "duplicate" documents, or just go look in the solrconfig.xml file for the core into which you are pushing the documents. There is likely a "dedupe" call is being made on the update handler, the fields used may be causing duplicate documents (based on a few fields) to be dropped. You'll see something like this

<requestHandler name="/dataimport"
        class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="update.chain">dedupe</str>     <<-- change dedupe to uuid
        <str name="config">dih-config.xml</str>        or comment the line
    </lst>
</requestHandler>

and later in the file the definition of the dedupe update.chain,

<updateRequestProcessorChain name="dedupe">
     <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">id</str>
         <bool name="overwriteDupes">true</bool>
-->>     <str name="fields">url,date,rawline</str>     <<--
         <str name="signatureClass">solr.processor.Lookup3Signature</str>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

The "fields" element is what will select which input data is used to determine the uniqueness of the record. Of course, if you know there's no duplication in your input data, this is not the issue. But the above configuration will throw out any records which are duplicate on the fields shown.

You may not be using the dataimport requestHandler, but rather the "update" requestHandler. I'm not sure which one Nutch uses. Either, way, you can simply comment out the update.chain, change it to a different processorChain such as "uuid", or add more fields to the "fields" declaration.

-1
votes

In my case this problem was due to missing solr indexer infomration in nutch-site.xml. When I update config, this problem was resolved. Please check your crawler log at indexing step. In my case it was informed that no solr indexer plugin is found.

Following lines (property) are added in nutch-site.xml

<property>
  <name>plugin.includes</name>
 <value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
 <description>plugin details here </description>
</property>