Cost of thread-safe local static variable initialization in C++11?

Question

We know that local static variable initialization is thread-safe in C++11, and modern compilers fully support this. (Is local static variable initialization thread-safe in C++11?)

What is the cost of making it thread-safe? I understand that this could very well be compiler implementation dependent.

Context: I have a multi-threaded application (10 threads) accessing a singleton object pool instance via the following function at very high rates, and I'm concerned about its performance implications.

template <class T>
ObjectPool<T>* ObjectPool<T>::GetInst()
{
    static ObjectPool<T> instance;
    return &instance;
}

Just a warning: When your application exits, constructors for static variables will be called. Including for singleton objects that are still in use by another thread. — gnasher729
How about measuring it with another known technique called "Double checked locking" with atomics ? Then you would have some benchmark and make an educated guess on the cost. — Arunmu
@Arunmu That's a good idea, I'll try it and see. I was hoping if someone could shed some light on how compilers actually implement it. — Sunanda
Pass it to argument of the thread, so you don't have any overhead from this part. — Jarod42
@Sunanda How it's implemented is basically up to the compiler and platform. But, from what I know most of them use some form of DCLP to achieve the thread safety. So, its' kind of safe to assume that it would be more efficient than whatever DCLP version that you can come up with. — Arunmu

nwp nwp · Accepted Answer · 2016-07-15T08:34:37

A look at the generated assembler code helps.

Source

#include <vector>

std::vector<int> &get(){
  static std::vector<int> v;
  return v;
}
int main(){
  return get().size();
}

Assembler

std::vector<int, std::allocator<int> >::~vector():
        movq    (%rdi), %rdi
        testq   %rdi, %rdi
        je      .L1
        jmp     operator delete(void*)
.L1:
        rep ret
get():
        movzbl  guard variable for get()::v(%rip), %eax
        testb   %al, %al
        je      .L15
        movl    get()::v, %eax
        ret
.L15:
        subq    $8, %rsp
        movl    guard variable for get()::v, %edi
        call    __cxa_guard_acquire
        testl   %eax, %eax
        je      .L6
        movl    guard variable for get()::v, %edi
        movq    $0, get()::v(%rip)
        movq    $0, get()::v+8(%rip)
        movq    $0, get()::v+16(%rip)
        call    __cxa_guard_release
        movl    $__dso_handle, %edx
        movl    get()::v, %esi
        movl    std::vector<int, std::allocator<int> >::~vector(), %edi
        call    __cxa_atexit
.L6:
        movl    get()::v, %eax
        addq    $8, %rsp
        ret
main:
        subq    $8, %rsp
        call    get()
        movq    8(%rax), %rdx
        subq    (%rax), %rdx
        addq    $8, %rsp
        movq    %rdx, %rax
        sarq    $2, %rax
        ret

Compared to

Source

#include <vector>

static std::vector<int> v;
std::vector<int> &get(){
  return v;
}
int main(){
  return get().size();
}

Assembler

std::vector<int, std::allocator<int> >::~vector():
        movq    (%rdi), %rdi
        testq   %rdi, %rdi
        je      .L1
        jmp     operator delete(void*)
.L1:
        rep ret
get():
        movl    v, %eax
        ret
main:
        movq    v+8(%rip), %rax
        subq    v(%rip), %rax
        sarq    $2, %rax
        ret
        movl    $__dso_handle, %edx
        movl    v, %esi
        movl    std::vector<int, std::allocator<int> >::~vector(), %edi
        movq    $0, v(%rip)
        movq    $0, v+8(%rip)
        movq    $0, v+16(%rip)
        jmp     __cxa_atexit

I'm not that great with assembler, but I can see that in the first version v has a lock around it and get is not inlined whereas in the second version get is essentially gone.
You can play around with various compilers and optimization flags, but it seems no compiler is able to inline or optimize out the locks, even though the program is obviously single threaded.
You can add static to get which makes gcc inline get while preserving the lock.

To know how much these locks and additional instructions cost for your compiler, flags, platform and surrounding code you would need to make a proper benchmark.
I would expect the locks to have some overhead and be significantly slower than the inlined code, which becomes insignificant when you actually do work with the vector, but you can never be sure without measuring.

Cost of thread-safe local static variable initialization in C++11?

3 Answers

Source

Assembler

Source

Assembler