ES Query Exception in Storm Crawler

Question

I am using following packages Apache zookeeper 3.4.14 Apache storm 1.2.3 Apache Maven 3.6.2 ElasticSearch 7.2.0 (hosted locally) Java 1.8.0_252 aws ec2 medium instance with 4GB ram

I have used this command to increase the virtual memory for jvm(Earlier it was showing error for jvm not having enough memory) sysctl -w vm.max_map_count=262144

I have created maven package with -
mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler - DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=LATEST

Command used for submitting topology
storm jar target/newscrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 30000

when i run this command, it shows my topology is submitted sucessfully, and in elasticsearch status index it shows FETCH_ERROR and also the url from seeds.txt

content index shows no hits in elasticsearch

In worker.log file there were many exceptions of following type-

java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_252]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) ~[?:1.8.0_252]
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351) [stormjar.jar:?]
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221) [stormjar.jar:?]
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) [stormjar.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

2020-06-12 10:31:14.635 c.d.s.e.p.AggregationSpout Thread-46-spout-executor[17 17] [INFO] [spout #7] Populating buffer with nextFetchDate <= 2020-06-12T10:30:50Z 2020-06-12 10:31:14.636 c.d.s.e.p.AggregationSpout Thread-32-spout-executor[19 19] [INFO] [spout #9] Populating buffer with nextFetchDate <= 2020-06-12T10:30:50Z 2020-06-12 10:31:14.636 c.d.s.e.p.AggregationSpout pool-13-thread-1 [ERROR] [spout #7] Exception with ES query

There are following logs in worker.log related to elasticsearch

'Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [http://localhost:9200], URI [/status/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&preference=_shards%3A1&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true], status line [HTTP/1.1 503 Service Unavailable] {"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"}],"type":"cluster_block_exception","reason":"blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];"},"status":503} '

' Suppressed: org.elasticsearch.client.ResponseException: method [POST], host [http://localhost:9200], URI [/status/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&preference=_shards%3A8&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true], status line [HTTP/1.1 503 Service Unavailable] {"error":{"root_cause":[],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[]},"status":503}

' I have checked health of shards, they are in green status.

Earlier i was using Java 11 , with which i was not able to submit topology so i shifted to java 8. Now topology is submitted sucessfully, but no data is injected in Elasticsearch.

I want to know if there is a problem with version imcompatibility between java and elasticsearch or with any oher package.

4 GB RAM may be not enough to run ES and a Storm topology. At least, it will require that you carefully configure the Java heap size of each component. — Sebastian Nagel
"reason":"all shards failed" - there should be something in the Elasticsearch logs. — Sebastian Nagel

Julien Nioche Julien Nioche · Accepted Answer · 2020-06-15T09:03:28

Use an absolute path for the seed file and run it in remote mode. The local mode should be used mostly for debugging. The sleep parameter is (I think) in milliseconds. The command above means that the topology will run for 30 seconds only, which doesn't give it much time to do anything.

ES Query Exception in Storm Crawler

1 Answers