I'm trying to build a realtime (private) chat between users of a video game with 25K+ concurrent connections. We currently run 32 nodes where users can connect through a load balancer. The problem I'm trying to solve is how to route messages to each user?
Currently, we are using socket.io & socket.io-redis, where each websocket joins a room with its user ID, and we emit each message they should receive to that room. The problem with this design is that we are reaching the limits of Redis Pubsub, and Socket.io which doesn't scale well (socket.io emit messages to all nodes which check if the user is connected, this is not viable).
Our current stack is composed of Postgres, Redis & RabbitMQ. I have been thinking about this problem a lot and have come up with 3 different solutions :
- Route all messages with RabbitMQ. When a user connects, we create an exchange with type fanout with the user ID and a queue per websocket connection (we have to handle multiple connections per user). When we want to emit to that user, we simply publish to that exchange. The problem with that approach is that we have to create a lot of queues, and I heard that this may not be very efficient.
- Create a queue for each node in RabbitMQ. When a user connects, we save the node & socket ID in a Redis Set, so that when we have to send a message to that specific user, we first get the list of nodes, emit to each node queue, which then handle routing to specific client in the app. The problems with that approach is that in the case of a node failure, we may store that a user is connected when this is not the case. To fix that, we would need to expire the users's Redis entry but this is not a perfect fix. Also, if we later want to implement group chat, it would mean we have to send duplicates messages in Rabbit, this is not ideal.
- Go all in with Firebase Cloud Messaging. We have a mobile app, and we plan to use it for push notifications when the user isn't connected, but would it be a good fit even if the user is connected?
What do you think is the best fit for our use case? Do you have any other idea?