I'm running an Apache Ignite .Net 2.7 cluster in the Linux environment in a Kubernetes cluster. The Ignite cluster consists of 5 Ignite nodes running three microservices (2x1st service, 2x2nd service and 1 3rd service). Two of the microservices deploy a couple of Ignite services which call each other.
The cluster start up successfully, discovery works fine and all the nodes are being added into the cluster. But out of the sudden, both instances of a service (2 nodes) fail with the following error:
java.lang.IllegalStateException: Getting affinity for topology version earlier than affinity is calculated [locNode=TcpDiscoveryNode [id=76308a3b-221a-4307-b181-bd4e66d82683, addrs=[10.0.0.62, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, product-service-deployment-7dd5496d58-l426m/10.0.0.62:47500], discPort=47500, order=8, intOrder=6, lastExchangeTime=1560283011887, loc=true, ver=2.7.0#20181130-sha1:256ae401, isClient=false], grp=ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=17, minorTopVer=0], head=AffinityTopologyVersion [topVer=18, minorTopVer=0], history=[AffinityTopologyVersion [topVer=9, minorTopVer=0], AffinityTopologyVersion [topVer=11, minorTopVer=0], AffinityTopologyVersion [topVer=11, minorTopVer=1], AffinityTopologyVersion [topVer=12, minorTopVer=0], AffinityTopologyVersion [topVer=14, minorTopVer=0], AffinityTopologyVersion [topVer=16, minorTopVer=0], AffinityTopologyVersion [topVer=18, minorTopVer=0]]]
at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:712)
at org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.nodes(GridAffinityAssignmentCache.java:612)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.nodesByPartition(GridCacheAffinityManager.java:226)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByPartition(GridCacheAffinityManager.java:266)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByKey(GridCacheAffinityManager.java:257)
at org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primaryByKey(GridCacheAffinityManager.java:281)
at org.apache.ignite.internal.processors.service.GridServiceProcessor$TopologyListener$1.run0(GridServiceProcessor.java:1877)
at org.apache.ignite.internal.processors.service.GridServiceProcessor$DepRunnable.run(GridServiceProcessor.java:2064)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This causes the other service to fail because it depends on the first service:
Unhandled Exception: Apache.Ignite.Core.Services.ServiceInvocationException: Proxy method invocation failed with an exception. Examine InnerException for details. ---> Apache.Ignite.Core.Common.IgniteException: Failed to find deployed service: ProductService ---> Apache.Ignite.Core.Common.JavaException: class org.apache.ignite.IgniteException: Failed to find deployed service: ProductService
Since the second service is being restarted by Kubernetes, the first service reports constant topology changes:
[19:57:14] Topology snapshot [ver=20, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:57:15] Topology snapshot [ver=21, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
[19:57:17] Topology snapshot [ver=22, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:57:49] Topology snapshot [ver=23, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
[19:57:50] Topology snapshot [ver=24, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:57:56] Topology snapshot [ver=25, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
[19:57:58] Topology snapshot [ver=26, locNode=76308a3b, servers=4, clients=0, state=ACTIVE, CPUs=4, offheap=6.2GB, heap=2.0GB]
[19:58:41] Topology snapshot [ver=27, locNode=76308a3b, servers=5, clients=0, state=ACTIVE, CPUs=5, offheap=7.8GB, heap=2.5GB]
Immediately before I identified this problem, I run a minor reconfiguration of the Kubernetes cluster which did not cause pod restarts. Not sure if it could be the cause of the condition in question.
Is it a known problem that has a solution? What should I check (particularly in the logs) which could cast the light on this situation?
Thank you!