I am currently making a search engine using the Apache Nutch and ElasticSearch stack. I am using Apache Nutch 2.1 and ElasticSearch 1.7.3.
I am currently trying to index directly from Nutch by following the instructions here: https://www.mind-it.info/2013/09/26/integrating-nutch-1-7-elasticsearch/. Both Nutch and Elasticsearch runs on my localhost, with cluster name "elasticsearch".
These are some of the parts of nutch-site.xml that I changed:
<property>
<name>plugin.includes</name>
<value>protocol-selenium|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
After running the command ant runtime, I tried issuing the command
bin/nutch elasticindex elasticsearch -all
But it returned this:
Exception in thread "main" java.lang.RuntimeException: job failed: name=elastic-index [elasticsearch], jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:52)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.indexElastic(ElasticIndexerJob.java:60)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:73)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.main(ElasticIndexerJob.java:78)
I'm not sure where I went wrong. Here is my hadoop.log:
2016-01-15 15:46:24,106 INFO elastic.ElasticIndexerJob - Starting
2016-01-15 15:46:24,733 INFO plugin.PluginRepository - Plugins: looking in: /home/gabrielgagno/apache-nutch-2.1/runtime/local/plugins
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Registered Plugins:
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Http / Https Protocol Plug-in (protocol-httpclient)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor)
2016-01-15 15:46:24,817 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Registered Extension-Points:
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Parse Filter (org.apache.nutch.parse.ParseFilter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2016-01-15 15:46:24,818 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2016-01-15 15:46:24,822 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:24,822 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:24,824 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:24,824 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:25,827 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-01-15 15:46:26,521 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2016-01-15 15:46:26,727 INFO elasticsearch.node - [Layla Miller] version[1.7.3], pid[18188], build[05d4530/2015-10-15T09:14:17Z]
2016-01-15 15:46:26,727 INFO elasticsearch.node - [Layla Miller] initializing ...
2016-01-15 15:46:26,852 INFO elasticsearch.plugins - [Layla Miller] loaded [], sites []
2016-01-15 15:46:28,229 WARN elasticsearch.bootstrap - JNA not found. native methods will be disabled.
2016-01-15 15:46:28,756 INFO elasticsearch.node - [Layla Miller] initialized
2016-01-15 15:46:28,756 INFO elasticsearch.node - [Layla Miller] starting ...
2016-01-15 15:46:28,824 INFO elasticsearch.transport - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/172.16.3.72:9301]}
2016-01-15 15:46:28,836 INFO elasticsearch.discovery - [Layla Miller] elasticsearch/_tzxV-I7SSeduY9b8enpPw
2016-01-15 15:46:58,836 WARN elasticsearch.discovery - [Layla Miller] waited for 30s and no initial state was set by the discovery
2016-01-15 15:46:58,845 INFO elasticsearch.http - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/172.16.3.72:9201]}
2016-01-15 15:46:58,845 INFO elasticsearch.node - [Layla Miller] started
2016-01-15 15:46:58,848 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:58,848 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:58,848 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:58,848 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:59,438 INFO elastic.ElasticWriter - Processing remaining requests [docs = 147, length = 1011442, total docs = 147]
2016-01-15 15:46:59,445 INFO elastic.ElasticWriter - Processing to finalize last execute
2016-01-15 15:47:59,452 WARN mapred.FileOutputCommitter - Output path is null in cleanup
2016-01-15 15:47:59,453 WARN mapred.LocalJobRunner - job_local_0001
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:151)
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:141)
at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:215)
at org.elasticsearch.action.bulk.TransportBulkAction.access$000(TransportBulkAction.java:67)
at org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(TransportBulkAction.java:153)
at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(TransportAction.java:137)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Can anyone help me with this? Thanks!