0
votes

I had integrated nutch 2.3.1 with solr 6.5, with this I could push data to solr and get indexed. Now I want to remove duplicate elements and for this I made the modifications in schema.xml and solrconfig.xml

<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />



<updateRequestProcessorChain name="dedupe">
   <processor class="solr.processor.SignatureUpdateProcessorFactory">
     <bool name="enabled">true</bool>
     <str name="signatureField">id</str>
     <bool name="overwriteDupes">false</bool>
     <str name="fields">id,content,date,url</str>  <!-- changing to id <str name="fields">name,features,cat</str>-->
     <str name="signatureClass">solr.processor.Lookup3Signature</str>
   </processor>
   <processor class="solr.LogUpdateProcessorFactory" />
   <processor class="solr.RunUpdateProcessorFactory" />
 </updateRequestProcessorChain>

<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
</requestHandler> 

but after indexing bin/nutch solrindex http://localhost:8983/solr/testcore -all error !! please help me to sort out this issue

thanking you in advance :)

1
So what is the error you're getting (include the whole error and any screenshot if available)? When does the error occur?MatsLindh
IndexingJob: starting SolrIndexerJob: java.lang.RuntimeException: job failed: name=apache-nutch-2.3.1.jar, jobid=job_local1823407340_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)VKM
this is the error I'm getting.. this happens only when I'm adding update request handler for dedupeVKM
Add the error log to your question, and include the error log from Solr as well - that'll be helpful in finding out what the issue is.MatsLindh

1 Answers

0
votes

This issue might be related to the schema updated, if you have some data existing in Solr and you updated the schema while that data exist in the core, Nutch will take it as a mismatch Schema, best way to fix this issue is re-crawling the webpage with the schema updated and keep in mind that any update to the schema will/could probably cause issues with you existing index.

Since post is already old, for future reference for people that could have the same issue.

Best :)