Shared Lua state between pthread seg-fault if not executing coroutine

Question

first of all I know my question look familiar but I am actually not asking why a seg-fault occurs when sharing a lua state between different pthread. I am actually asking why they don't seg-fault in a specific case described below. I tried to organize it as well as I could but I realize it is very long. Sorry about that. A bit of background: I am writing a program which is using the Lua interpreter as a base for the user to execute instructions and using the ROOT libraries (https://root.cern.ch/) to display graphs, histograms, etc... All of this is working just fine but then I tried to implement a way for the user to start a background task while keeping the ability to input commands in the Lua prompt, to be able to do something else entirely while the task finishes, or to request to stop it for instance. My first attempt was the following: First on the Lua side I load some helper functions and initialize global variables

-- Lua script
RootTasks = {}
NextTaskToStart = nil

function SetupNewTask(taskname, fn, ...)
  local task = function(...)
      local rets = table.pack(fn(...))

      RootTasks[taskname].status = "done"

      return table.unpack(rets)
    end

  RootTasks[taskname] = {
    task = SetupNewTask_C(task, ...),
    status = "waiting",
  }

  NextTaskToStart = taskname
end

Then on the C side

// inside the C++ script
int SetupNewTask_C ( lua_State* L )
{
    // just a function to check if the argument is valid
    if ( !CheckLuaArgs ( L, 1, true, "SetupNewTask_C", LUA_TFUNCTION ) ) return 0;

    int nvals = lua_gettop ( L );

    lua_newtable ( L );

    for ( int i = 0; i < nvals; i++ )
    {
        lua_pushvalue ( L, 1 );
        lua_remove ( L, 1 );
        lua_seti ( L, -2, i+1 );
    }

    return 1;
}

Basically the user provide the function to execute followed by the parameters to pass and it just pushes a table with the function to execute as the first field and the arguments as subsequent fields. This table is pushed on top of the stack, I retrieve it and store it a global variable. The next step is on the Lua side

-- Lua script
function StartNewTask(taskname, fn, ...)
  SetupNewTask(taskname, fn, ...)
  StartNewTask_C()
  RootTasks[taskname].status = "running"
end

and on the C side

// In the C++ script
// lua, called below, is a pointer to the lua_State 
// created when starting the Lua interpreter

void* NewTaskFn ( void* arg )
{
    // helper function to get global fields from 
    // strings like "something.field.subfield"
    // Retrieve the name of the task to be started (has been pushed as 
    // a global variable by previous call to SetupNewTask_C)
    TryGetGlobalField ( lua, "NextTaskToStart" );

    if ( lua_type ( lua, -1 ) != LUA_TSTRING )
    {
        cerr << "Next task to schedule is undetermined..." << endl;
        return nullptr;
    }

    string nextTask = lua_tostring ( lua, -1 );
    lua_pop ( lua, 1 );

    // Now we get the actual table with the function to execute 
    // and the arguments
    TryGetGlobalField ( lua, ( string ) ( "RootTasks."+nextTask ) );

    if ( lua_type ( lua, -1 ) != LUA_TTABLE )
    {
        cerr << "This task does not exists or has an invalid format..." << endl;
        return nullptr;
    }

    // The field "task" from the previous table contains the 
    // function and arguments
    lua_getfield ( lua, -1, "task" );

    if ( lua_type ( lua, -1 ) != LUA_TTABLE )
    {
        cerr << "This task has an invalid format..." << endl;
        return nullptr;
    }

    lua_remove ( lua, -2 );

    int taskStackPos = lua_gettop ( lua );

    // The first element of the table we retrieved is the function so the
    // number of arguments for that function is the table length - 1
    int nargs = lua_rawlen ( lua, -1 ) - 1;

    // That will be the function
    lua_geti ( lua, taskStackPos, 1 );

    // And the arguments...
    for ( int i = 0; i < nargs; i++ )
    {
        lua_geti ( lua, taskStackPos, i+2 );
    }

    lua_remove ( lua, taskStackPos );

    // I just reset the global variable NextTaskToStart as we are 
    // about to start the scheduled one.
    lua_pushnil ( lua );
    TrySetGlobalField ( lua, "NextTaskToStart" );

    // Let's go!
    lua_pcall ( lua, nargs, LUA_MULTRET, 0 );
}

int StartNewTask_C ( lua_State* L )
{
    pthread_t newTask;

    pthread_create ( &newTask, nullptr, NewTaskFn, nullptr );

    return 0;
}

So for instance a call in the Lua interpreter to

> StartNewTask("PeriodicPrint", function(str) for i=1,10 print(str);
>> sleep(1); end end, "Hello")

Will produce for the next 10 seconds a print of "Hello" every second. It will then return from execution and everything is wonderful. Now if I ever hit ENTER key while that task is running, the program dies in horrible seg-fault sufferings (which I don't copy here as each time it seg-fault the error log is different, sometimes there is no error at all). So I read a bit online what could be the matter and I found several mention that the lua_State are not thread safe. I don't really understand why just hitting ENTER will make it flip out but that's not really the point here.

I discovered by accident that this approach could work without any seg-faulting with a tiny modification. Instead of running the function directly, if a coroutine is executed, everything I wrote above works just fine.

replace the previous Lua side function SetupNewTask with

function SetupNewTask(taskname, fn, ...)
  local task = coroutine.create( function(...)
      local rets = table.pack(fn(...))

      RootTasks[taskname].status = "done"

      return table.unpack(rets)
    end)

  local taskfn = function(...)
    coroutine.resume(task, ...)
  end

  RootTasks[taskname] = {
    task = SetupNewTask_C(taskfn, ...),
    routine = task,
    status = "waiting",
  }

  NextTaskToStart = taskname
end

I can execute several tasks at once for extended period of time without getting any seg-faults. So we finally come to my question: Why using coroutine works? What is the fundamental difference in this case? I just call coroutine.resume and I do not do any yield (or anything else for what matters). Then just wait for the coroutine to be done and that's it. Are coroutine doing something I do not suspect?

nobody nobody · Accepted Answer · 2017-10-06T06:52:59

That it seems as if nothing broke doesn't mean that it actually works, so…

What's in a `lua_State`?

(This is what a coroutine is.)

A lua_State stores this coroutine's state – most importantly its stack, CallInfo list, a pointer to the global_State, and a bunch of other stuff.

If you hit return in the REPL of the standard Lua interpreter, the interpreter tries to run the code you typed. (An empty line is also a program.) This involves putting it on the Lua stack, calling some functions, etc. etc. If you have code running in a different OS thread that is also using the same Lua stack/state… well, I think it's clear why this breaks, right? (One part of the problem is caching of stuff that "doesn't"/shouldn't change (but changes because another thread is also messing with it). Both threads are pushing/popping stuff on the same stack and step on each other's feet. If you want to dig through the code, luaV_execute may be a good starting point.)

So now you're using two different coroutines, and all the obvious sources of problems are gone. Now it works, right…? Nope, because coroutines share state,

The `global_State`!

This is where the "registry", string cache, and all the things related to garbage collection live. And while you got rid of the main "high-frequency" source of errors (stack handling), many many other "low-frequency" sources remain. A brief (non-exhaustive!) list of some of them:

You can potentially trigger a garbage collection step by any allocation, which will then run the GC for a bit, which uses its shared structures. And while allocations usually don't trigger the GC, the GCdebt counter that controls this is part of the global state, so once it crosses the threshold, allocations on multiple threads at the same time have a good chance to start the GC on several threads at once. (If that happens, it'll almost certainly explode violently.) Any allocation means, among others
- creating tables, coroutines, userdata, …
- concatenating strings, reading from files, tostring(), …
- calling functions(!) (if that requires growing the stack or allocating a new CallInfo slot)
- etc.
(Re-)Setting a thing's metatable may modify GC structures. (If the metatable has __gc or __mode, it gets added to a list.)
Adding new fields to tables, which may trigger a resize. If you're also accessing it from another thread during the resize (even just reading existing fields), well… *boom*. (Or not boom, because while the data may have moved to a different area, the memory where it was before is probably still accessible. So it might "work" or only lead to silent corruption.)
Even if you stopped the GC, creating new strings is unsafe because it may modify the string cache.

And then probably lots of other things…

Making it Fail

For fun, you can re-build Lua and #define both HARDSTACKTESTS and HARDMEMTESTS (e.g. at the very top of luaconf.h). This will enable some code that will reallocate the stack and run a full GC cycle in many places. (For me, it does 260 stack reallocations and 235 collections just until it brings up the prompt. Just hitting return (running an empty program) does 13 stack reallocations and 6 collections.) Running your program that seems to work with that enabled will probably make it crash… or maybe not?

Why it might still "work"

So for instance a call in the Lua interpreter to
StartNewTask("PeriodicPrint", function(str)
  for i=1,10  print(str); sleep(1);  end
end, "Hello")
Will produce for the next 10 seconds a print of "Hello" every second.

In this particular example, there's not much happening. All the functions and strings are allocated before you start the thread. Without HARDSTACKTESTS, you might be lucky and the stack is already big enough. Even if the stack needs to grow, the allocation (& collection cycle because HARDMEMTESTS) may have the right timing so that it doesn't break horribly. But the more "real work" that your test program does, the more likely it will be that it will crash. (One good way to do that is to create lots of tables and stuff so the GC needs more time for the full cycle and the time window for interesting race conditions gets bigger. Or maybe just repeatedly run a dummy function really fast like for i = 1, 1e9 do (function() return i end)() end on 2+ threads and hope for the best… err, worst.)

Shared Lua state between pthread seg-fault if not executing coroutine

1 Answers

What's in a lua_State?

The global_State!

Making it Fail

Why it might still "work"

What's in a `lua_State`?

The `global_State`!