We are currently in the process of implementing retry support in gRPC, as per the following design:
https://github.com/grpc/proposal/blob/master/A6-client-retries.md
Unfortunately, the implementation is fairly complex and will take a while to complete, so it's not ready for use yet. In the interim, you probably need to implement your own retry code in your application.
With regard to your environment, one common problem that people see when using gRPC with AWS is that DNS doesn't provide a way to proactively inform the client when the servers' IP addresses change when they are restarted. If that's the problem you're seeing, then there are a couple of possible things you can try:
Try using the round_robin load-balancing policy, so that if you lose contact with one backend, you will still be able to talk to the others while the client attempts to regain contact with the one that went down. You can do this by passing the channel argument {"grpc.lb_policy_name": "round_robin"}
in the third argument to a client constructor.
Another alternative would be to set up a look-aside load balancer that would dynamically send new IP addresses to the clients. For more information on this architecture, see https://github.com/grpc/grpc/blob/master/doc/load-balancing.md.
Good luck!