4
votes

We know that local static variable initialization is thread-safe in C++11, and modern compilers fully support this. (Is local static variable initialization thread-safe in C++11?)

What is the cost of making it thread-safe? I understand that this could very well be compiler implementation dependent.

Context: I have a multi-threaded application (10 threads) accessing a singleton object pool instance via the following function at very high rates, and I'm concerned about its performance implications.

template <class T>
ObjectPool<T>* ObjectPool<T>::GetInst()
{
    static ObjectPool<T> instance;
    return &instance;
}
3
Just a warning: When your application exits, constructors for static variables will be called. Including for singleton objects that are still in use by another thread. - gnasher729
How about measuring it with another known technique called "Double checked locking" with atomics ? Then you would have some benchmark and make an educated guess on the cost. - Arunmu
@Arunmu That's a good idea, I'll try it and see. I was hoping if someone could shed some light on how compilers actually implement it. - Sunanda
Pass it to argument of the thread, so you don't have any overhead from this part. - Jarod42
@Sunanda How it's implemented is basically up to the compiler and platform. But, from what I know most of them use some form of DCLP to achieve the thread safety. So, its' kind of safe to assume that it would be more efficient than whatever DCLP version that you can come up with. - Arunmu

3 Answers

5
votes

A look at the generated assembler code helps.

Source

#include <vector>

std::vector<int> &get(){
  static std::vector<int> v;
  return v;
}
int main(){
  return get().size();
}

Assembler

std::vector<int, std::allocator<int> >::~vector():
        movq    (%rdi), %rdi
        testq   %rdi, %rdi
        je      .L1
        jmp     operator delete(void*)
.L1:
        rep ret
get():
        movzbl  guard variable for get()::v(%rip), %eax
        testb   %al, %al
        je      .L15
        movl    get()::v, %eax
        ret
.L15:
        subq    $8, %rsp
        movl    guard variable for get()::v, %edi
        call    __cxa_guard_acquire
        testl   %eax, %eax
        je      .L6
        movl    guard variable for get()::v, %edi
        movq    $0, get()::v(%rip)
        movq    $0, get()::v+8(%rip)
        movq    $0, get()::v+16(%rip)
        call    __cxa_guard_release
        movl    $__dso_handle, %edx
        movl    get()::v, %esi
        movl    std::vector<int, std::allocator<int> >::~vector(), %edi
        call    __cxa_atexit
.L6:
        movl    get()::v, %eax
        addq    $8, %rsp
        ret
main:
        subq    $8, %rsp
        call    get()
        movq    8(%rax), %rdx
        subq    (%rax), %rdx
        addq    $8, %rsp
        movq    %rdx, %rax
        sarq    $2, %rax
        ret

Compared to

Source

#include <vector>

static std::vector<int> v;
std::vector<int> &get(){
  return v;
}
int main(){
  return get().size();
}

Assembler

std::vector<int, std::allocator<int> >::~vector():
        movq    (%rdi), %rdi
        testq   %rdi, %rdi
        je      .L1
        jmp     operator delete(void*)
.L1:
        rep ret
get():
        movl    v, %eax
        ret
main:
        movq    v+8(%rip), %rax
        subq    v(%rip), %rax
        sarq    $2, %rax
        ret
        movl    $__dso_handle, %edx
        movl    v, %esi
        movl    std::vector<int, std::allocator<int> >::~vector(), %edi
        movq    $0, v(%rip)
        movq    $0, v+8(%rip)
        movq    $0, v+16(%rip)
        jmp     __cxa_atexit

I'm not that great with assembler, but I can see that in the first version v has a lock around it and get is not inlined whereas in the second version get is essentially gone.
You can play around with various compilers and optimization flags, but it seems no compiler is able to inline or optimize out the locks, even though the program is obviously single threaded.
You can add static to get which makes gcc inline get while preserving the lock.

To know how much these locks and additional instructions cost for your compiler, flags, platform and surrounding code you would need to make a proper benchmark.
I would expect the locks to have some overhead and be significantly slower than the inlined code, which becomes insignificant when you actually do work with the vector, but you can never be sure without measuring.

2
votes

From my experience, this is exactly as costly as a regular mutex (critical section). If the code is called very frequently, consider using a normal global variable instead.

1
votes

Explained extensively here https://www.youtube.com/watch?v=B3WWsKFePiM by Jason Turner.

I put a sample code to illustrate the video. Since thread-safety is the main issue, I tried to call the method from multiple threads to see its effects.

You can think that compiler is implementing double-checking lock for you even though they can do whatever they want to ensure thread-safety. But they will at least add a branch to distinguish first time initialization unless optimizer does initialization at the global scope eagerly.

https://en.wikipedia.org/wiki/Double-checked_locking#Usage_in_C++11

#include <iostream>
#include <string>
#include <vector>
#include <thread>

struct Temp
{
  // Everytime this method is called, compiler has to check whether `name` is
  // constructed or not due to init-at-first-use idiom. This at least would 
  // involve an atomic load operation and maybe a lock acquisition.
  static const std::string& name() {
    static const std::string name = "name";
    return name;
  }

  // Following does not create contention. Profiler showed little bit of
  // performance improvement.
  const std::string& ref_name = name();
  const std::string& get_name_ref() const {
    return ref_name;
  }
};

int main(int, char**)
{
  Temp tmp;

  constexpr int num_worker = 8;
  std::vector<std::thread> threads;
  for (int i = 0; i < num_worker; ++i) {
    threads.emplace_back([&](){
      for (int i = 0; i < 10000000; ++i) {
        // name() is almost 5s slower
        printf("%zu\n", tmp.get_name_ref().size());
      }
    });
  }

  for (int i = 0; i < num_worker; ++i) {
    threads[i].join();
  }

  return 0;
}

The name() version is 5s slower than get_name_ref() on my machine.

$ time ./test > /dev/null

Also I used compiler explorer to see what gcc generates. Following proves double checking lock pattern: Pay attention to atomic loads and guards acquired.

name ()
{
  bool retval.0;
  bool retval.1;
  bool D.25443;
  struct allocator D.25437;
  const struct string & D.29013;
  static const struct string name;

  _1 = __atomic_load_1 (&_ZGVZL4namevE4name, 2);
  retval.0 = _1 == 0;
  if (retval.0 != 0) goto <D.29003>; else goto <D.29004>;
  <D.29003>:
  _2 = __cxa_guard_acquire (&_ZGVZL4namevE4name);
  retval.1 = _2 != 0;
  if (retval.1 != 0) goto <D.29006>; else goto <D.29007>;
  <D.29006>:
  D.25443 = 0;
  try
    {
      std::allocator<char>::allocator (&D.25437);
      try
        {
          try
            {
              std::__cxx11::basic_string<char>::basic_string (&name, "name", &D.25437);
              D.25443 = 1;
              __cxa_guard_release (&_ZGVZL4namevE4name);
              __cxa_atexit (__dt_comp , &name, &__dso_handle);
            }
          finally
            {
              std::allocator<char>::~allocator (&D.25437);
            }
        }
      finally
        {
          D.25437 = {CLOBBER};
        }
    }
  catch
    {
      if (D.25443 != 0) goto <D.29008>; else goto <D.29009>;
      <D.29008>:
      goto <D.29010>;
      <D.29009>:
      __cxa_guard_abort (&_ZGVZL4namevE4name);
      <D.29010>:
    }
  goto <D.29011>;
  <D.29007>:
  <D.29011>:
  goto <D.29012>;
  <D.29004>:
  <D.29012>:
  D.29013 = &name;
  return D.29013;
}