i am using stormcrawler 1.16 with ELasticsearch-7.2.0. java version is 1.8.0_252 . storm version is 1.2.3, maven version is 3.6.3.
i have created project using mvn archetype -
mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -
DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=LATEST
i made a seeds.txt file and put 9 urls only for testing and submitted the topology using the command given in REAEDME.md file in --remote mode.
it run successfully and crawl the pages as intented.
but problem arise when i put 8000 URLs in seeds.txt file.
i run ES_IndexInit.sh file again and submitt the topology using the same command as i previously did. then i get this error -
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
2020-06-12 11:26:11.416 c.d.s.e.p.AggregationSpout pool-12-thread-1 [ERROR] [spout #1] Exception with ES query
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_252]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) ~[?:1.8.0_252]
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148) [stormjar.jar:?]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351) [stormjar.jar:?]
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221) [stormjar.jar:?]
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) [stormjar.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
then i look for errors in worker.log file. i found same error over there. then i check health of my shards -
{
"cluster_name" : "my-cluster1",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 2,
"active_shards" : 2,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0,
"indices" : {
".kibana_task_manager" : {
"status" : "green",
"number_of_shards" : 1,
"number_of_replicas" : 0,
"active_primary_shards" : 1,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"shards" : {
"0" : {
"status" : "green",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}
}
},
".kibana_1" : {
"status" : "green",
"number_of_shards" : 1,
"number_of_replicas" : 0,
"active_primary_shards" : 1,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"shards" : {
"0" : {
"status" : "green",
"primary_active" : true,
"active_shards" : 1,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}
}
}
}
}
shards's health is green. now if i submit crawler topology in a different or new project, topology remains ideal and do not emitts or transffered any tuple.
am i using versions that are compatible with each other ? should i use java 11 for elasticsearch or it works fine ?
details about instance - i am using ec2 medium intance ubuntu 18.04 with 4 gb memory.
someone please explain what the issue is ?