Solr indexing following a Nutch crawl fails, reports "Indexer: java.io.IOException: Job failed!"

Question

I have integrated Nutch1.13 with Solr 6.5.1 on an ec2-instance. I did copy schema.xml to Solr using below cp command.I have given localhost as elatic.host in nutch-site.xml in nutch_home/conf folder.

cp /usr/local/apache-nutch-1.13/conf/schema.xml /usr/local/apache-nutch-1.13/solr-6.5.1/server/solr/nutch/conf/

Also every time managed-schema is created since it's solr 6. Everything till indexing works fine. The command I tried is

[ec2-user@ip-172-31-16-137 apache-nutch-1.13]$ bin/crawl -i -D solr.server.url=http://35.160.82.191:8983/solr/#/nutch/ urls/ crawl 1

Everything seems fine till indexing while running above command. I'm totally stuck at this last step.

Error running: /usr/local/apache-nutch-1.13/bin/nutch index -Dsolr.server.url=://35.160.82.191:8983/solr/#/nutch/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/20170519074733 Failed with exit value 255.

Thanks in advance

UPDATE I changed below prperty in conf/nutch-site.xml

<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

Now no error But I get following

Deduplication finished at 2017-05-19 10:08:05, elapsed: 00:00:03 Indexing 20170519100420 to index /usr/local/apache-nutch-1.13/bin/nutch index -Dsolr.server.url=//35.160.82.191:8983/solr/nutch/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/20170519100420 Segment dir is complete: crawl/segments/20170519100420. Indexer: starting at 2017-05-19 10:08:06 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false No IndexWriters activated - check your configuration Indexer: number of documents indexed, deleted, or skipped: Indexer: 44 indexed (add/update) Indexer: finished at 2017-05-19 10:08:10, elapsed: 00:00:03 Cleaning up index if possible /usr/local/apache-nutch-1.13/bin/nutch clean -Dsolr.server.url=//35.160.82.191:8983/solr/nutch/ crawl/crawldb Fri May 19 10:08:13 UTC 2017 : Finished loop with 1 iterations

UPDATE 2 I found adding solr-indexer in nutch-site.xml help as sugested in this post but now error is in cleaning part

Error running: /usr/local/apache-nutch-1.13/bin/nutch clean -Dsolr.server.url=://35.160.82.191:8983/solr/nutch/ crawl/crawldb Failed with exit value 255.

Any suggestions since I want to implement a search engine using Solr UPDATE 3

Now no error at all. but fetching is not working for some reason. Only urls specified in urls/seed.txt are fetched and crawled. no external links are followed by nutch.

[ec2-user@ip-172-31-16-137 apache-nutch-1.13]$ bin/crawl -i -D solr.server.url=http://35.160.82.191:8983/solr/nutch/ urls/ crawl 5 Injecting seed URLs /usr/local/apache-nutch-1.13/bin/nutch inject crawl/crawldb urls/ Injector: starting at 2017-05-19 12:27:19 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 1 Injector: Total new urls injected: 0 Injector: finished at 2017-05-19 12:27:21, elapsed: 00:00:02 Fri May 19 12:27:21 UTC 2017 : Iteration 1 of 5 Generating a new segment /usr/local/apache-nutch-1.13/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter Generator: starting at 2017-05-19 12:27:23 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: 0 records selected for fetching, exiting ... Generate returned 1 (no new segments created) Escaping loop: no more URLs to fetch now

I want to use nutch data for web search results from Solr FINAL UPDATE

[ec2-user@ip-172-31-16-137 apache-nutch-1.13]$ bin/crawl -i -D solr.server.url=://35.160.82.191:8983/solr/nutch/ urls/ crawl  1

Segment dir is complete: crawl/segments/20170519074733. Indexer: starting at 2017-05-19 07:52:41 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false — SMJ
Active IndexWriters : ElasticIndexWriter elastic.cluster : elastic prefix cluster elastic.host : hostname elastic.port : port elastic.index : elastic index command elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) elastic.max.bulk.size : elastic bulk index length in bytes. (default 2500500) elastic.exponential.backoff.millis : elastic bulk exponential backoff initial delay in milliseconds. (default 100) elastic.exponential.backoff.retries : elastic bulk exponential backoff max retries. (default 10) — SMJ
Error running: /usr/local/apache-nutch-1.13/bin/nutch index -Dsolr.server.url=35.160.82.191:8983/solr/#/nutch crawl/crawldb -linkdb crawl/linkdb crawl/segments/20170519074733 Failed with exit value 255. — SMJ
Cleaning up index if possible /usr/local/apache-nutch-1.13/bin/nutch clean -Dsolr.server.url=://35.160.82.191:8983/solr/nutch/ crawl/crawldb SolrIndexer: deleting 2/2 documents ERROR CleaningJob: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:174) at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:197) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:208) — SMJ

Jorge Luis Jorge Luis · Accepted Answer · 2017-05-19T08:43:44

nutch-site.xml doesn't need to be copied over to Solr, only the schema.xml file to specify the schema that you want for the data coming from Nutch. If you're using Solr and not ES, then this parameter elatic.host is not required. Check the logs/hadoop.log file to see if there is more data about the exception, and of course, check the logs on the Solr side, this error usually means that something is wrong with the Solr configuration, missing fields, etc. In this case, since you didn't copy the schema.xml and Nutch is not taking advantage of the managed schema on Solr 6, Solr must be complaining about the missing fields, also your solr URL including the # character doesn't look good, that's how the Solr Admin UI shows the data in the browser but to use it from Nutch/terminal should be /solr/nutch.

BTW Check the tutorial although some of the paths have changed in recent Solr versions is still a good guideline on how the integration works

Solr indexing following a Nutch crawl fails, reports "Indexer: java.io.IOException: Job failed!"

1 Answers