17
votes

There is something that bugs me with the Java memory model (if i even understand everything correctly). If there are two threads A and B, there are no guarantees that B will ever see a value written by A, unless both A and B synchronize on the same monitor.

For any system architecture that guarantees cache coherency between threads, there is no problem. But if the architecture does not support cache coherency in hardware, this essentially means that whenever a thread enters a monitor, all memory changes made before must be commited to main memory, and the cache must be invalidated. And it needs to be the entire data cache, not just a few lines, since the monitor has no information which variables in memory it guards. But that would surely impact performance of any application that needs to synchronize frequently (especially things like job queues with short running jobs). So can Java work reasonably well on architectures without hardware cache-coherency? If not, why doesn't the memory model make stronger guarantees about visibility? Wouldn't it be more efficient if the language would require information what is guarded by a monitor?

As i see it the memory model gives us the worst of both worlds, the absolute need to synchronize, even if cache coherency is guaranteed in hardware, and on the other hand bad performance on incoherent architectures (full cache flushes). So shouldn't it be more strict (require information what is guarded by a monitor) or more lose and restrict potential platforms to cache-coherent architectures?

As it is now, it doesn't make too much sense to me. Can somebody clear up why this specific memory model was choosen?


EDIT: My use of strict and lose was a bad choice in retrospect. I used "strict" for the case where less guarantees are made and "lose" for the opposite. To avoid confusion, its probably better to speak in terms of stronger or weaker guarantees.

4
One thing that needs to be commented on is the statement "synchronize on the same monitor". As I understand it synchronization on the same monitor guarantees execution order around that monitor. However, in terms of memory cache, acquiring any synchronized block forces a cache reload and releasing any lock forces a flush of all writes from working CPU memory.Gray
Which specific arch did you had in mind?curiousguy

4 Answers

6
votes

the absolute need to synchronize, even if cache coherency is guaranteed in hardware

Yes, but then you only have to reason against the Java Memory Model, not against a particular hardware architecture that your program happens to run on. Plus, it's not only about the hardware, the compiler and JIT themselves might reorder the instructions causing visibility issue. Synchronization constructs in Java addresses visibility & atomicity consistently at all possible levels of code transformation (e.g. compiler/JIT/CPU/cache).

and on the other hand bad performance on incoherent architectures (full cache flushes)

Maybe I misunderstood s/t, but with incoherent architectures, you have to synchronize critical sections anyway. Otherwise, you'll run into all sort of race conditions due to the reordering. I don't see why the Java Memory Model makes the matter any worse.

shouldn't it be more strict (require information what is guarded by a monitor)

I don't think it's possible to tell the CPU to flush any particular part of the cache at all. The best the compiler can do is emitting memory fences and let the CPU decides which parts of the cache need flushing - it's still more coarse-grained than what you're looking for I suppose. Even if more fine-grained control is possible, I think it would make concurrent programming even more difficult (it's difficult enough already).

AFAIK, the Java 5 MM (just like the .NET CLR MM) is more "strict" than memory models of common architectures like x86 and IA64. Therefore, it makes the reasoning about it relatively simpler. Yet, it obviously shouldn't offer s/t closer to sequential consistency because that would hurt performance significantly as fewer compiler/JIT/CPU/cache optimizations could be applied.

5
votes

Existing architectures guarantee cache coherency, but they do not guarantee sequential consistency - the two things are different. Since seq. consistency is not guaranteed, some reorderings are allowed by the hardware and you need critical sections to limit them. Critical sections make sure that what one thread writes becomes visible to another (i.e., they prevent data races), and they also prevent the classical race conditions (if two threads increment the same variable, you need that for each thread the read of the current value and the write of the new value are indivisible).

Moreover, the execution model isn't as expensive as you describe. On most existing architectures, which are cache-coherent but not sequentially consistent, when you release a lock you must flush pending writes to memory, and when you acquire one you might need to do something to make sure future reads will not read stale values - mostly that means just preventing that reads are moved too early, since the cache is kept coherent; but reads must still not be moved.

Finally, you seem to think that Java's Memory Model (JMM) is peculiar, while the foundations are nowadays fairly state-of-the-art, and similar to Ada, POSIX locks (depending on the interpretation of the standard), and the C/C++ memory model. You might want to read the JSR-133 cookbook which explains how the JMM is implemented on existing architectures: http://g.oswego.edu/dl/jmm/cookbook.html.

4
votes

The answer would be that most multiprocessors are cache-coherent, including big NUMA systems, which almost? always are ccNUMA.

I think you are somewhat confused as to how cache coherency is acomplished in practice. First, caches may be coherent/incoherent with respect to several other things on the system:

  • Devices
  • (Memory modified by) DMA
  • Data caches vs instruction caches
  • Caches on other cores/processors (the one this question is about)
  • ...

Something has to be made to maintain coherency. When working with devices and DMA, on architectures with incoherent caches with respect to DMA/devices, you would either bypass the cache (and possibly the write buffer), or invalidate/flush the cache around operations involving DMA/devices.

Similarly, when dynamically generating code, you may need to flush the instruction cache.

When it comes to CPU caches, coherency is achieved using some coherency protocol, such as MESI, MOESI, ... These protocols define messages to be sent between caches in response to certain events (e.g: invalidate-requests to other caches when a non-exclusive cacheline is modified, ...).

While this is sufficient to maintain (eventual) coherency, it doesn't guarantee ordering, or that changes are immediately visible to other CPUs. Then, there are also write buffers, which delay writes.

So, each CPU architecture provides ordering guarantees (e.g. accesses before an aligned store cannot be reordered after the store) and/or provide instructions (memory barriers/fences) to request such guarantees. In the end, entering/exiting a monitor doesn't entail flushing the cache, but may entail draining the write buffer, and/or stall waiting for reads to end.

1
votes

the caches that JVM has access to are really just CPU registers. since there aren't many of them, flushing them upon monitor exit isn't a big deal.

EDIT: (in general) the memory caches are not under the control of JVM, JVM cannot choose to read/write/flush these caches, so forget about them in this discussion

imagine each CPU has 1,000,000 registers. JVM happily exploits them to do crazy fast computations - until it bumps into monitor enter/exit, and has to flush 1,000,000 registers to the next cache layer.

if we live in that world, either Java must be smart enough to analyze what objects aren't shared (majority of objects aren't), or it must ask programmers to do that.

java memory model is a simplified programming model that allows average programmers make OK multithreading algorithms. by 'simplified' I mean there might be 12 people in the entire world who really read chapter 17 of JLS and actually understood it.