spin-on-test-and-set vs spin-on-read in Java

Question

Background

I'm using an open source off-heap cache implementation, ohc.

Recently, I found a pull request on github which suggests to

replace spin-on-compare-and-set with spin-on-read.

Here is the change of code, it adds only one line while(lockFieldUpdater.get(this) != 0L), which like

    while (true)
    {
        if (lockFieldUpdater.compareAndSet(this, 0L, t))
            return true;

            // while(lockFieldUpdater.get(this) != 0L)
            Thread.yield();
    }

Benchmark performance

I compile it and use the benchmark tool to test it:

Online performance

Then I use it on production, the original average time cost of reading is about 35,000 nanoseconds, and it only cost 10,000 nanoseconds with the new version.

Question

What's the difference of these two implementations? Why test-on-read is much more faster in this case?

My answer is generalized but if you would like a more specific answer, it would be good to see your benchmarks, number of threads, what kind of hardware you tested on and what hardware is in production, etc. — Eric
@Eric Thank you very much. I use the default benchmark parameters provided by the ohc. And I tested it on my macbook. The production machine get 72 cores and about 20 thread working in the same time. — xingbin
What are the specs on your macbook? I'm assuming the production server is some NUMA-based architecture (maybe Intel Xeon?) — Eric

Eric Eric · Accepted Answer · 2019-08-25T17:37:22

In order to understand why the performance improves, it is good to know a little bit about cache coherency protocols. The original version relies solely on a heavy-handed read-modify-write operation and yields if it fails. It's heavy-handed because the CAS operation will generate quite a bit of cache coherence traffic obtaining ownership of the cacheline, invalidating the copies on the other cores, etc. This naive approach leads to a whole lot of contention as the number of threads increases.

The modified version is an improvement over the naive approach because it synchronizes the threads' actions a bit better. By making sure each thread will spin on its own cached copy, it's only once the local copy has been invalidated (modified in another core), that the thread will once again be allowed to attempt a CAS.

This is very similar to why TATAS locks are an improvement over naive TAS locks.

As to why your local benchmarks show a ~6% speedup while your production server sees ~3.5x speedups can probably be explained for a few reasons.

Your production server benefits a lot from spinning on a local variable since there are severe performance hits for memory accesses across NUMA nodes.
Both the TAS lock and the TATAS lock performance degrades as the number of threads contending for the lock increases. But the TATAS lock degrades slower than the TAS lock. This blog entry Test-and-set spinlocks has a nice chart illustrating this. Maybe your local benchmarks are too small to see any significant improvement?