43
votes

One of the features of Erlang (and, by definition, Elixir) is that you can do hot code swap. However, this seems to be at odd with Docker, where you would need to stop your instances and restart new ones with new images holding the new code. This essentially seem to be what everyone does.

This being said, I also know that it is possible to use one hidden node to distribute updates to all other nodes over network. Of course, just like that is sounds like asking for trouble, but...

My question would be the following: has anyone tried and achieved with reasonable success to set up a Docker-based infrastructure for Erlang/Elixir that allowed Hot-code swapping? If so, what are the do's, don'ts and caveats?

1
Despite my comment I believe this has not been really answered yet. I believe that more could be written as to the different implementation strategies of using HCS within Docker. My guess is that Docker or not would not matter as long as you maintain PROD aligned with HEAD it won't matter if a new container is pushed through Docker or Erlang. But do keep things sync!Aki
Docker is not a particularly good fit for Erlang -- it is just the only way to deploy anything many younger shops know today. The sort of problems most people try to solve with Docker (lightweight parallelism, execution environment hygiene, "the network is the abstraction", process isolation, memory safety, etc.) are the problems Erlang already addresses in a more complete fashion. Hot code upgrades require a high degree of control and knowledge of the data types in motion and of the storage/code loading environment for the runtime. Docker prevents the second requirement in the normal case.zxq9
@zxq9: This gives me more confidence in running Phoenix/Elixir apps outside of Docker but does HCS actually makes sense in a DevOps environment using CircleCI or TravisCI? Fundamentally, is Erlang HCS not orthogonal to Docker AND CI tools? This calls for a re-assessment of the HCS feature, we know that PaaS often do not allow the nodes to interact with each other the way HCS needs to work well. This should be a whole new post: Should Erlang web apps sacrifice HCS in favour of Docker and CI?Aki
@Aki CI tools are another thing like Docker: they solve a problem Erlang has already solved in a different way. I suppose one could coerce Travis to integrate with HCS, but I don't know why this would be desirable. The exceptional case (which applies universally to Docker, ephemeral VM instances and CI tools) is using an external, remote code server for Erlang code. Meaning, all of your instances (or whatever) load code not from the local system, but on the fly from a remote code server as pre-compiled binaries (which is part of why *.ez compressed binaries were introduced).zxq9
@Aki I don't have the time to do it now[1], but I would love to write a howto related to use of nearly-empty runtimes and remote code servers -- and maybe this would be the way to square the circle of youthful and naive DevOps staff who know nothing other than github-integrated CI tooling + Docker for deployment. ([1] Nobody is paying or will pay me to, and I'm already contributing all my unpaid time to creation of a redundant peer-distributed source package repo system + related tools for on-demand compile+run functionality. It probably applies more to client-side than backend, though.)zxq9

1 Answers

28
votes

The story

Imagine a system to handle mobile phone calls or mobile data access (that's what Erlang was created for). There are gateway servers that maintain the user session for the duration of the call, or the data access session (I will call it the session going forward). Those server have an in-memory representation of the session for as long as the session is active (user is connected).

Now there is another system that calculates how much to charge the user for the call or the data transfered (call it PDF - Policy Decision Function). Both systems are connected in such a way that the gateway server creates a handful of TCP connections to PDF and it drops users sessions if those TCP connections go down. The gateway can handle a few hundred thousand customers at a time. Whenever there is an event that the user needs to be charged for (next data transfer, another minute of the call) the gateway notifies PDF about the fact and PDF subtracts a specific amount of money from the user account. When the user account is empty PDF notifies the gateway to disconnect the call (you've run out of money, you need to top up).

Your question

Finally let's talk about your question in this context. We want to upgrade a PDF node and the node is running on Docker. We create a new Docker instance with the new version of the software, but we can't shut down the old version (there are hundreds of thousands of customers in the middle of their call, we can't disconnect them). But we need to move the customers somehow from the old PDF to the new version. So we tell the gateway node to create any new connections with the updated node instead of the old PDF. Customers can be chatty and also some of them may have a long-running data connections (downloading Windows 10 iso) so the whole operation takes 2-3 days to complete. That's how long it can take to upgrade one version of the software to another in case of a critical bug. And there may be dozens of servers like this one, each one handling hundreds thousands of customers.

But what if we used the Erlang release handler instead? We create the relup file with the new version of the software. We test it properly and deploy to PDF nodes. Each node is upgraded in-place - the internal state of the application is converted, the node is running the new version of the software. But most importantly, the TCP connection with the gateway server has not been dropped. So customers happily continue their calls or are downloading the latest Windows iso while we are upgrading the system. All is done in 10 seconds rather than 2-3 days.

The answer

This is an example of a specific system with specific requirements. Docker and Erlang's Release Handling are orthogonal technologies. You can use either or both, it all boils down to the following:

  • Requirements
  • Cost

Will you have enough resources to test both approaches predictably and enough patience to teach your Ops team so that they can deploy the system using either method? What if the testing facility cost millions of pounds (because of the required hardware) and can use only one of those two methods at a time (because the test cycle takes days)?

The pragmatic approach might be to deploy the nodes initially using Docker and then upgrade them with Erlang release handler (if you need to use Docker in the first place). Or, if your system doesn't need to be available during the upgrade (as the example PDF system does), you might just opt for always deploying new versions with Docker and forget about release handling. Or you may as well stick with release handler and forget about Docker if you need quick and reliable updates on-the-fly and Docker would be only used for the initial deployment. I hope that helps.