3
votes

My Qt/C++ app uses worker threads (QThread) to improve performance for users with multicore processors. Each worker's job is to manipulate some data. Each worker minds it's own business and does not need to communicate with any other workers. They also don't perform any IO operations. Perfect use case!

The use of multithreading for this workload has delightfully improved performance by many factors over.

Running on a Ryzen 9 3900X (12 cores)


However, now each worker is also tasked with passing it's data through a Lua script. So, each worker get's it's own Lua script instance (an object containing it's own lua_State). The data is passed between the native code and the Lua script through userdata in the form of pointers to these things I call "SharedObjects." All I have to do is derive from this SharedObject class and boom, Lua can talk to it!

All my Lua workload does is some basic logic and calling native functions to allocate new things that derive from SharedObject and return them. Basically, it creates a lot of SharedObjects and connects them to each other in specific ways.


When the script has a light workload the multithreaded performance stays great.

But once the script has a heavy workload the performance drops as the thread count rises above 4.

Here's the results of the tests I ran:

enter image description here

I don't understand why a heavy workload causes performance to get worse as thread count goes up??? I would expect it to reach a maximum and flatline....


EDIT: I created a minimal reproducible example project that perfectly simulates the problem. I compiled with MSVC2010 (as per my real application). https://github.com/MRG95/LuaThreads

Explanation of GitHub project files:

  • main.cpp: Entry point. Creates the workers and simulates a workload. A timer keeps track of how long it takes to complete the work.
  • Lua/lua_script.h: The interface between the lua script and native code. Native methods and properties are accessed through Qt's QMetaObject implementation. the function void bindObject() sets up the connection.
  • worker.h: Defines the Worker class which gets moved to it's QThread via moveToThread. The script function call happens in void doWork().
  • tags.h/tags.cpp: Example data types that get processed in the script.

In the build folder is a file testScript.lua that is the sample workload itself. It's just a simple loop running some of the methods found in the tags.h classes.

2
Show some minimal reproducible example in your question. We cannot reliably guess the software architecture of your design. - Basile Starynkevitch
There's so much though.... I'm looking more for potential causes of more threads causing worse performance vs single thread. What could even cause a slowdown in the first place? I'd do the reproducible example after I've tried some more things first since I've isolated the problem script calls pretty well. - mrg95
What is your main OS? Linux or something else? - Basile Starynkevitch
Windows 10 only - mrg95
Accept my condolences. Linux would be simpler. - Basile Starynkevitch

2 Answers

5
votes

Note the DirectConnection which means it's not queuing the calls.

This could be wrong. Read more about QThread-s. Maybe you should use Qt::QueuedConnection

Let's assume that each QThread runs its own Lua interpreter and state (you should study the source code of your Lua interpreter, but it might have some GIL, or practically need one).

We cannot guess your source code, but you might want to use Per-Thread Event Loop and have every Lua interpreter in its QThread and use some fine-grained QMutex on global shared state data. So small and short Lua primitives would each use some shared QMutex

Remember that Qt graphics operations are allowed only from the main thread (the one connected to the Xorg server on Linux).

But what I can't understand at all is why a heavy workload causes performance to get worse as thread count goes up???

It might be related to CPU cache and cache coherence. Don't except magic performance scaling when the number of all active threads and processes is more than the number of cores.

This clearly indicates to me that Lua is the bottleneck

I am not sure it is correct, and without seeing your source code, I believe it could be wrong. The bottleneck is probably inside your own code (which you don't show). To be sure, study the source code of Lua.

You could use profiling tools (on Linux, gprof(1) or perf(1)). If you compile your C++ code and the source code of Lua with GCC, you may need to invoke it specifically.

3
votes

I was able to figure out the issue, and it was unrelated to Lua, QThread, or CPU caching. It's actually really unexpected.

The problem lies with QMetaType lookups. When comparing QMetaMethod parameter types to determine which native functions the Lua script was attempting to call, I make use of the method QMetaType::Type().

When Qt looks up the type, it does so in a global lookup that causes these multithreading issues. The answer was to instead compare the type names themselves and avoid all QMetaType lookups altogether.

So all this time, it was Qt getting in my way. Their documentation says QMetaType is thread-safe, which it is, but at the cost of halting other threads until each lookup finishes. With this new method I'm using, it ends up performing worse in single threaded use cases, but better in multithreaded. In the future, I plan to switch between the two depending on the number of running threads.