1
votes

We created an Akka Cluster infrastructure for Sms, Email and Push notifications. 3 different kind of nodes are exist in the system, which are client, sender and lighthouse. Client role is being used by Web application and API application(Web and API is hosted at IIS). Lighthouse and Sender roles are being hosted as a Windows service. By taking consideration that Web app and API app AppPools recycles because of IIS, in global.asax.cs's Start and Stop event, we shutdown actor system in Client roles and start again. We can observe through the logs that system succesfully shutdowns and joins the Cluster.

But sometimes, when AppPool recycles, client ActorSystem starts but can't join the Cluster and our Notification's stops working(which is a huge problem for us). When we manually shotdowns ActorSystem and make it work again manually, it joins the Cluster. This situation happens approximately every two days.

We can observe that Client joins the Cluster before the Error;

Node [akka.tcp://NotificationSystem@...:41350] is JOINING, roles [client]
Leader is moving node [akka.tcp://NotificationSystem@...:41350] to [Up]

By looking at the logs, we can see following error after client joins the cluster;

Shut down address: akka.tcp://NotificationSystem@...:41350Akka.Remote.ShutDownAssociation: Shut down address: akka.tcp://NotificationSystem@...:41350 ---> Akka.Remote.Transport.InvalidAssociationException: The remote system terminated the association because it is shutting down. --- End of inner exception stack trace --- at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level) at Akka.Remote.EndpointWriter.b__20_0(Exception ex) at Akka.Actor.LocalOnlyDecider.Decide(Exception cause) at Akka.Actor.OneForOneStrategy.Handle(IActorRef child, Exception x) at Akka.Actor.SupervisorStrategy.HandleFailure(ActorCell actorCell, Exception cause, ChildRestartStats failedChildStats, IReadOnlyCollection1 allChildren) at Akka.Actor.ActorCell.HandleFailed(Failed f) at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)--- End of stack trace from previous location where exception was thrown --- at Akka.Actor.ActorCell.HandleFailed(Failed f) at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)Akka.Remote.ShutDownAssociation: Shut down address: akka.tcp://NotificationSystem@...:41350 ---> Akka.Remote.Transport.InvalidAssociationException: The remote system terminated the association because it is shutting down. --- End of inner exception stack trace --- at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level) at Akka.Remote.EndpointWriter.b__20_0(Exception ex) at Akka.Actor.LocalOnlyDecider.Decide(Exception cause) at Akka.Actor.OneForOneStrategy.Handle(IActorRef child, Exception x) at Akka.Actor.SupervisorStrategy.HandleFailure(ActorCell actorCell, Exception cause, ChildRestartStats failedChildStats, IReadOnlyCollection`1 allChildren) at Akka.Actor.ActorCell.HandleFailed(Failed f) at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)--- End of stack trace from previous location where exception was thrown --- at Akka.Actor.ActorCell.HandleFailed(Failed f) at Akka.Actor.ActorCell.SystemInvoke(Envelope envelope)

After error, we see that following error message;

Association to [akka.tcp://NotificationSystem@...:41350] having UID [226948907] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.

Without restarting the client actor, the system doesn't correct itself.

Our Client Role configuration is;

<akka>
<hocon>
    <![CDATA[
        akka{
            loglevel = DEBUG

            actor{
                provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"

                deployment {
                    /coordinatorRouter {
                        router = round-robin-group
                        routees.paths = ["/user/NotificationCoordinator"]
                        cluster {
                                enabled = on
                                max-nr-of-instances-per-node = 1
                                allow-local-routees = off
                                use-role = sender
                        }
                    }                
                }

                serializers {
                    wire = "Akka.Serialization.WireSerializer, Akka.Serialization.Wire"
                }

                serialization-bindings {
                 "System.Object" = wire
                }

                debug{
                    receive = on
                    autoreceive = on
                    lifecycle = on
                    event-stream = on
                    unhandled = on
                }
            }

            remote {
                helios.tcp {
                        transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
                        applied-adapters = []
                        transport-protocol = tcp
                        hostname = "***.***.**.**"
                        port = 0
                }
            }

            cluster {
                    seed-nodes = ["akka.tcp://NotificationSystem@***.***.**.**:5053", "akka.tcp://NotificationSystem@***.***.**.**:5073"]
                    roles = [client]
            }
        }
    ]]>
</hocon>

Our Sender Role configuration is;

  <akka>
<hocon><![CDATA[
            akka{
                loglevel = INFO

                loggers = ["Akka.Logger.NLog.NLogLogger, Akka.Logger.NLog"]

                actor{
                    debug {  
                        # receive = on 
                        # autoreceive = on
                        # lifecycle = on
                        # event-stream = on
                        # unhandled = on
                    }         

                    provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"           

                    serializers {
                        wire = "Akka.Serialization.WireSerializer, Akka.Serialization.Wire"
                    }

                    serialization-bindings {
                     "System.Object" = wire
                    }

                    deployment{
                        /NotificationCoordinator/ApplePushNotificationActor{
                            router = round-robin-pool
                            resizer{
                                enabled = on
                                lower-bound = 3
                                upper-bound = 5
                            }
                        }

                        /NotificationCoordinator/AndroidPushNotificationActor{
                            router = round-robin-pool
                            resizer{
                                enabled = on
                                lower-bound = 3
                                upper-bound = 5
                            }
                        }

                        /NotificationCoordinator/EmailActor{
                            router = round-robin-pool
                            resizer{
                                enabled = on
                                lower-bound = 3
                                upper-bound = 5
                            }
                        }

                        /NotificationCoordinator/SmsActor{
                            router = round-robin-pool
                            resizer{
                                enabled = on
                                lower-bound = 3
                                upper-bound = 5
                            }
                        }

                        /NotificationCoordinator/LoggingCoordinator/ResponseLoggerActor{
                            router = round-robin-pool
                            resizer{
                                enabled = on
                                lower-bound = 3
                                upper-bound = 5
                            }
                        }                           
                    }
                }

             remote{                            
                        log-remote-lifecycle-events = DEBUG
                        log-received-messages = on

                        helios.tcp{
                            transport-class = "Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote"
                            applied-adapters = []
                            transport-protocol = tcp
                            #will be populated with a dynamic host-name at runtime if left uncommented
                            #public-hostname = "POPULATE STATIC IP HERE"
                            hostname = "***.***.**.**"
                            port = 0
                    }
                }

                cluster {
                        seed-nodes = ["akka.tcp://NotificationSystem@***.***.**.**:5053", "akka.tcp://NotificationSystem@***.***.**.**:5073"]
                        roles = [sender]
                }
            }
        ]]></hocon>

How can we solve this problem? Thank you.

1

1 Answers

2
votes

This is definitely a bug with the EndpointManager in Akka.Remote. Akka.NET 1.1 - due to be released on June 14th, should address this. We've fixed a ton of cluster rejoin bugs along these lines but they haven't been released just yet. Akka.Cluster will be RTM-ed as part of that release.

In the meantime, you could also try using the Akka.NET Nightly Builds if you want to try the new bits now.