0
votes

I am currently making a search engine using the Apache Nutch and ElasticSearch stack. I am using Apache Nutch 2.1 and ElasticSearch 1.7.3.

I am currently trying to index directly from Nutch by following the instructions here: https://www.mind-it.info/2013/09/26/integrating-nutch-1-7-elasticsearch/. Both Nutch and Elasticsearch runs on my localhost, with cluster name "elasticsearch".

These are some of the parts of nutch-site.xml that I changed:

<property>
    <name>plugin.includes</name>
    <value>protocol-selenium|protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Regular expression naming plugin directory names to
    include.  Any plugin not matching this expression is excluded.
    In any case you need at least include the nutch-extensionpoints plugin. By
    default Nutch includes crawling just HTML and plain text via HTTP,
    and basic indexing and search plugins. In order to use HTTPS please enable
    protocol-httpclient, but be aware of possible intermittent problems with the
    underlying commons-httpclient library.
    </description>
</property>

After running the command ant runtime, I tried issuing the command

bin/nutch elasticindex elasticsearch -all

But it returned this:

Exception in thread "main" java.lang.RuntimeException: job failed: name=elastic-index [elasticsearch], jobid=job_local_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:52)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.indexElastic(ElasticIndexerJob.java:60)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.run(ElasticIndexerJob.java:73)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.elastic.ElasticIndexerJob.main(ElasticIndexerJob.java:78)

I'm not sure where I went wrong. Here is my hadoop.log:

    2016-01-15 15:46:24,106 INFO  elastic.ElasticIndexerJob - Starting
2016-01-15 15:46:24,733 INFO  plugin.PluginRepository - Plugins: looking in: /home/gabrielgagno/apache-nutch-2.1/runtime/local/plugins
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository - Registered Plugins:
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     the nutch core extension points (nutch-extensionpoints)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Basic URL Normalizer (urlnormalizer-basic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Basic Indexing Filter (index-basic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Html Parse Plug-in (parse-html)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Http / Https Protocol Plug-in (protocol-httpclient)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     HTTP Framework (lib-http)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Filter (urlfilter-regex)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Pass-through URL Normalizer (urlnormalizer-pass)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Normalizer (urlnormalizer-regex)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Tika Parser Plug-in (parse-tika)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     OPIC Scoring Plug-in (scoring-opic)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     CyberNeko HTML Parser (lib-nekohtml)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Anchor Indexing Filter (index-anchor)
2016-01-15 15:46:24,817 INFO  plugin.PluginRepository -     Regex URL Filter Framework (lib-regex-filter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository - Registered Extension-Points:
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Protocol (org.apache.nutch.protocol.Protocol)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Parse Filter (org.apache.nutch.parse.ParseFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch URL Filter (org.apache.nutch.net.URLFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Content Parser (org.apache.nutch.parse.Parser)
2016-01-15 15:46:24,818 INFO  plugin.PluginRepository -     Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2016-01-15 15:46:24,822 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:24,822 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:24,824 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:24,824 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:25,827 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-01-15 15:46:26,521 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2016-01-15 15:46:26,727 INFO  elasticsearch.node - [Layla Miller] version[1.7.3], pid[18188], build[05d4530/2015-10-15T09:14:17Z]
2016-01-15 15:46:26,727 INFO  elasticsearch.node - [Layla Miller] initializing ...
2016-01-15 15:46:26,852 INFO  elasticsearch.plugins - [Layla Miller] loaded [], sites []
2016-01-15 15:46:28,229 WARN  elasticsearch.bootstrap - JNA not found. native methods will be disabled.
2016-01-15 15:46:28,756 INFO  elasticsearch.node - [Layla Miller] initialized
2016-01-15 15:46:28,756 INFO  elasticsearch.node - [Layla Miller] starting ...
2016-01-15 15:46:28,824 INFO  elasticsearch.transport - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/172.16.3.72:9301]}
2016-01-15 15:46:28,836 INFO  elasticsearch.discovery - [Layla Miller] elasticsearch/_tzxV-I7SSeduY9b8enpPw
2016-01-15 15:46:58,836 WARN  elasticsearch.discovery - [Layla Miller] waited for 30s and no initial state was set by the discovery
2016-01-15 15:46:58,845 INFO  elasticsearch.http - [Layla Miller] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/172.16.3.72:9201]}
2016-01-15 15:46:58,845 INFO  elasticsearch.node - [Layla Miller] started
2016-01-15 15:46:58,848 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100
2016-01-15 15:46:58,848 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2016-01-15 15:46:58,848 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-01-15 15:46:58,848 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-01-15 15:46:59,438 INFO  elastic.ElasticWriter - Processing remaining requests [docs = 147, length = 1011442, total docs = 147]
2016-01-15 15:46:59,445 INFO  elastic.ElasticWriter - Processing to finalize last execute
2016-01-15 15:47:59,452 WARN  mapred.FileOutputCommitter - Output path is null in cleanup
2016-01-15 15:47:59,453 WARN  mapred.LocalJobRunner - job_local_0001
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:151)
    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:141)
    at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:215)
    at org.elasticsearch.action.bulk.TransportBulkAction.access$000(TransportBulkAction.java:67)
    at org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(TransportBulkAction.java:153)
    at org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(TransportAction.java:137)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Can anyone help me with this? Thanks!

1

1 Answers

0
votes

Make sure you are running the same versions in nutch elastic dependency and your local server.

If they are not the same, then do not waste your time, and use the http protocol to push directly to elastic from nutch instead of the Java api.