I have integrated Nutch1.13 with Solr 6.5.1 on an ec2-instance. I did copy schema.xml to Solr using below cp command.I have given localhost as elatic.host in nutch-site.xml in nutch_home/conf folder.
cp /usr/local/apache-nutch-1.13/conf/schema.xml /usr/local/apache-nutch-1.13/solr-6.5.1/server/solr/nutch/conf/
Also every time managed-schema is created since it's solr 6. Everything till indexing works fine. The command I tried is
[ec2-user@ip-172-31-16-137 apache-nutch-1.13]$ bin/crawl -i -D solr.server.url=http://35.160.82.191:8983/solr/#/nutch/ urls/ crawl 1
Everything seems fine till indexing while running above command. I'm totally stuck at this last step.
Error running: /usr/local/apache-nutch-1.13/bin/nutch index -Dsolr.server.url=://35.160.82.191:8983/solr/#/nutch/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/20170519074733 Failed with exit value 255.
Thanks in advance
UPDATE I changed below prperty in conf/nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
Now no error But I get following
Deduplication finished at 2017-05-19 10:08:05, elapsed: 00:00:03 Indexing 20170519100420 to index /usr/local/apache-nutch-1.13/bin/nutch index -Dsolr.server.url=//35.160.82.191:8983/solr/nutch/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/20170519100420 Segment dir is complete: crawl/segments/20170519100420. Indexer: starting at 2017-05-19 10:08:06 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false No IndexWriters activated - check your configuration Indexer: number of documents indexed, deleted, or skipped: Indexer: 44 indexed (add/update) Indexer: finished at 2017-05-19 10:08:10, elapsed: 00:00:03 Cleaning up index if possible /usr/local/apache-nutch-1.13/bin/nutch clean -Dsolr.server.url=//35.160.82.191:8983/solr/nutch/ crawl/crawldb Fri May 19 10:08:13 UTC 2017 : Finished loop with 1 iterations
UPDATE 2 I found adding solr-indexer in nutch-site.xml help as sugested in this post but now error is in cleaning part
Error running: /usr/local/apache-nutch-1.13/bin/nutch clean -Dsolr.server.url=://35.160.82.191:8983/solr/nutch/ crawl/crawldb Failed with exit value 255.
Any suggestions since I want to implement a search engine using Solr UPDATE 3
Now no error at all. but fetching is not working for some reason. Only urls specified in urls/seed.txt are fetched and crawled. no external links are followed by nutch.
[ec2-user@ip-172-31-16-137 apache-nutch-1.13]$ bin/crawl -i -D solr.server.url=http://35.160.82.191:8983/solr/nutch/ urls/ crawl 5 Injecting seed URLs /usr/local/apache-nutch-1.13/bin/nutch inject crawl/crawldb urls/ Injector: starting at 2017-05-19 12:27:19 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total urls rejected by filters: 0 Injector: Total urls injected after normalization and filtering: 1 Injector: Total urls injected but already in CrawlDb: 1 Injector: Total new urls injected: 0 Injector: finished at 2017-05-19 12:27:21, elapsed: 00:00:02 Fri May 19 12:27:21 UTC 2017 : Iteration 1 of 5 Generating a new segment /usr/local/apache-nutch-1.13/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter Generator: starting at 2017-05-19 12:27:23 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: topN: 50000 Generator: 0 records selected for fetching, exiting ... Generate returned 1 (no new segments created) Escaping loop: no more URLs to fetch now
I want to use nutch data for web search results from Solr FINAL UPDATE
[ec2-user@ip-172-31-16-137 apache-nutch-1.13]$ bin/crawl -i -D solr.server.url=://35.160.82.191:8983/solr/nutch/ urls/ crawl 1