I am currently working on a live media server, which will allow general consumers to send live video to us. In our current environment we've seen broadcasts sent to us with the duration of days, so the idea of being able to fix a bug (or add a feature) without disconnecting users is extremely compelling.
However as I was writing code I realized that hot code swapping doesn't make any sense unless I write every process so that all state is always done inside a gen_server, and all external modules that gen_server calls must be as simple as possible.
Let's take the following example:
-module(server_template).
-behaviour(gen_server).
-export([start/1, stop/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]).
start() -> gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
init([]) -> {ok, {module1:new(), module2:new()}}.
handle_call(Message, From, State) -> {reply, ok, State}.
handle_cast(any_message, {state1, state2}) ->
new_state1 = module1:do_something(state1),
new_state2 = module2:do_something(state2),
{noreply, {new_state1, new_state2}}.
handle_info(_Message, _Server) -> {noreply, _Server}.
terminate(_Reason, _Server) -> ok.
code_change(_OldVersion, {state1, state2}, _Extra) ->
new_state1 = module1:code_change(state1),
new_state2 = module2:code_change(state2)
{ok, {new_state1, new_state2}}
According to what I could find, when a new version of code is loaded into the currently running runtime without using an OTP system, you can upgrade into the current code version by calling your module as an external function call, so my_module:loop(state)
.
What I also see is that when a hot swap is performed the code_change/3
function is called and upgrades the state, so I can use that to make sure each of my dependent modules migrates the last state they gave me into state for the current code version. It does this because the supervisor knows about the running process, which allows the process to be suspended so it can call the code change function. All good.
However, if calling an external module always calls the current version of that module then this would seem to break if a hot swap is done mid-function. For example, same my gen_server is currently in the process of handling the any_message
cast, say in between running module1:do_something()
and module2:do_something()
.
If I am understanding things correctly, module2:do_something()
would now call the newly current version of the do_something
function, which could potentially mean I'm passing in unmigrated data into the new version of module2:do_something()
. This would easily cause issues if it's a record that has changed, an array with an unexpected number of elements, or even if a map is missing a value that the code expects.
Am I misunderstanding how this situation works? If this is right this seems to indicate that I must track some type of version details for any data structure that may transition module boundaries, and every public function must check that version number and perform an on demand migration if necessary.
That seems to be an extremely tall order that seems crazily error prone, so I am wondering if I am missing something.
code_change/3
and then continues in work. It is much more controlled hot code upgrade. The other thing is, if something crashes, OTP allows you to restart it very quickly so if yourmodule2:do_something/1
crashes with new data format it can recover. Those things go hand to hand together and are still an order of magnitude simpler and more robust than in any other runtime environment. – Hynek -Pichi- Vychodilhandle_call/handle_cast
calls are in progress, otherwise it wouldn't be able to use the migrated state, correct? – KallDrexxsys:suspend/1,2
,sys:change_code/4,5
, andsys:resume/1,2
to suspend, upgrade and then resume the process. – Greggen_server
'scode_change
callback, nor how the pipeline progresses if the release handler suspends the process mid call/cast. It seems pretty unreasonable to expect thatsys:code_change
will trigger anmy_gen_server:code_change
, which would return updated state that would correctly get applied to ahandle_cast
call in progress during the suspension. That seems too magical (especially with the concept of immutability) to be clean. – KallDrexx