6
votes

I am currently working on a live media server, which will allow general consumers to send live video to us. In our current environment we've seen broadcasts sent to us with the duration of days, so the idea of being able to fix a bug (or add a feature) without disconnecting users is extremely compelling.

However as I was writing code I realized that hot code swapping doesn't make any sense unless I write every process so that all state is always done inside a gen_server, and all external modules that gen_server calls must be as simple as possible.

Let's take the following example:

-module(server_template).
-behaviour(gen_server).

-export([start/1, stop/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]).

start() -> gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

init([]) -> {ok, {module1:new(), module2:new()}}.

handle_call(Message, From, State) -> {reply, ok, State}.

handle_cast(any_message, {state1, state2}) -> 
    new_state1 = module1:do_something(state1),
    new_state2 = module2:do_something(state2),
    {noreply, {new_state1, new_state2}}.

handle_info(_Message, _Server) -> {noreply, _Server}.

terminate(_Reason, _Server) -> ok.

code_change(_OldVersion, {state1, state2}, _Extra) -> 
    new_state1 = module1:code_change(state1),
    new_state2 = module2:code_change(state2)
    {ok, {new_state1, new_state2}}

According to what I could find, when a new version of code is loaded into the currently running runtime without using an OTP system, you can upgrade into the current code version by calling your module as an external function call, so my_module:loop(state).

What I also see is that when a hot swap is performed the code_change/3 function is called and upgrades the state, so I can use that to make sure each of my dependent modules migrates the last state they gave me into state for the current code version. It does this because the supervisor knows about the running process, which allows the process to be suspended so it can call the code change function. All good.

However, if calling an external module always calls the current version of that module then this would seem to break if a hot swap is done mid-function. For example, same my gen_server is currently in the process of handling the any_message cast, say in between running module1:do_something() and module2:do_something().

If I am understanding things correctly, module2:do_something() would now call the newly current version of the do_something function, which could potentially mean I'm passing in unmigrated data into the new version of module2:do_something(). This would easily cause issues if it's a record that has changed, an array with an unexpected number of elements, or even if a map is missing a value that the code expects.

Am I misunderstanding how this situation works? If this is right this seems to indicate that I must track some type of version details for any data structure that may transition module boundaries, and every public function must check that version number and perform an on demand migration if necessary.

That seems to be an extremely tall order that seems crazily error prone, so I am wondering if I am missing something.

2
You got it right. OTP makes hot code upgrade more controlled. It suspends execution of OTP compliant code, loads new version, calls code_change/3 and then continues in work. It is much more controlled hot code upgrade. The other thing is, if something crashes, OTP allows you to restart it very quickly so if your module2:do_something/1 crashes with new data format it can recover. Those things go hand to hand together and are still an order of magnitude simpler and more robust than in any other runtime environment.Hynek -Pichi- Vychodil
Actually the suspension of the OTP process isn't very clear to me. I'm assuming it suspends the process after any in progress handle_call/handle_cast calls are in progress, otherwise it wouldn't be able to use the migrated state, correct?KallDrexx
This is explained here (in the update subsection): erlang.org/doc/design_principles/release_handling.html#id78465 Release Handler uses sys:suspend/1,2, sys:change_code/4,5, and sys:resume/1,2 to suspend, upgrade and then resume the process.Greg
That doesn't really explain how suspension and resumption are used in conjunction with the gen_server's code_change callback, nor how the pipeline progresses if the release handler suspends the process mid call/cast. It seems pretty unreasonable to expect that sys:code_change will trigger an my_gen_server:code_change, which would return updated state that would correctly get applied to a handle_cast call in progress during the suspension. That seems too magical (especially with the concept of immutability) to be clean.KallDrexx

2 Answers

8
votes

Yes, you are exactly right. No one said hot code swapping is easy. I worked for a telecommunication company where all code upgrades were performed on a live system (so that users aren't disconnected in the middle of their calls). Doing it right means carefully considering all those scenarios that you mentioned and preparing the code for every failure, then testing, then fixing issues, testing, and so on. To test it properly you would need a system running the old version under load (e.g. in a testing environment), then deploying the new code and checking for any crashes.

In this particular example mentioned in your question the simplest way of dealing with this issue is writing two versions of module2:do_something/1, one accepting the old state and one accepting the new state. Then dealing with the old state accordingly, e.g. converting it to the new state.

For this to work you will also need to ensure that the new version of module2 is deployed before any module has a chance to call it with the new state:

  1. If the application containing module2 is a dependency of the other application release_handler will upgrade that module first.

  2. Otherwise, you may need to split the deployment into two parts, firstly upgrading the common functions so that they can handle the new state, then deploying new versions of gen_servers and other modules that make calls to module2.

  3. If you are not using the release handler you could manually specify in which order the modules are loaded.

This is also the reason why in Erlang it's advised to avoid circular dependencies in function calls between modules, e.g. when modA calls a function in modB which calls another function in modA.

For upgrades performed with the help of release handler you can verify the order in which release_handler will upgrade modules on the old system in the relup file that the release_handler generates based on the old and new release. It's a text file containing all instructions for the upgrade, e.g.: remove (to remove modules), load_object_code (load new module), load, purge, etc.

Please note that there is no strict requirement that all applications must follow OTP principles for the hot code swapping to work, however using gen_server's and a proper supervisor stack makes this task much easier to handle for both, the developer and the release handler.

If you are not using OTP release you can't upgrade using the release handler, but you can still forcefully reload modules on your system and upgrade them to the new version. This works fine as long as you don't need to add/remove Erlang applications, because for that the release definition would need to change, and that can't be done on a live system without the support from the release handler.

3
votes

The release handling calls sys:suspend which sends a message to the gen_server. The server will keep processing requests until it handles the suspend message at which time it basically just sits and waits. The new module version is then loaded into the system, sys:change_code is called which tells the server to call the code_change callback to do its upgrade and then the server again sits and waits. When the release handler calls sys:resume it sends a message to the server which tells it to get back to work and start processing incoming messages again.

The release handling does this at the same time for all servers which are dependent on a module. So first all are suspended, then the new module is loaded, then all are told to upgrade themselves and then finally all are told to resume work.