4
votes

I know the load/store queue architecture to facilitate store to load forwarding and disambiguation of out-of-order speculative loads. This is accomplished using matching load and store addresses.

This matching address technique will not work if the earlier store is to unaligned address and the load depends on it. My question is if this second load issued out-of-order how it gets disambiguated by earlier stores? or what policies modern architectures use to handle this condition?

2
If your reads and writes are served within the cache (hits), would it matter if the addresses are unaligned? I think for the situation you are referring to happen, your caches needs to be write-no-allocate or write-through with invalidate, followed by the read. Even so, wouldn't the write buffer in the next level serve the read?Isuru H
"This matching address technique will not work if the earlier store is to unaligned address and the load depends on it" have you an official reference for this? The store buffer is inside the core, it doesn't really care about unaligned addresses AFAIK. However, complications like crossing a cache line/page/size-mismatch etc are known to possibly prevent certain optimisations Finally, Load-store reordering is defined in the Intel manual 3. Can you clarify what do you mean by "disambiguated by earlier stores"?Margaret Bloom
@IsuruH this question is about load/store queue structure which comes before L1D cache. it is not about the cache hit or miss but how loads gets disambiguated if they have executed speculatively out-of-order. Many modern architectures support speculative execution of loads as they are critical for the performance.user1669844
@lax, apologies for the late response and for my misunderstanding. If I understand correctly, you are asking about a situation where the speculative write is performed on an address that falls at the edge of a cache line (part of this cache line and part in next) and there is a speculative load afterwards to one of the modified (speculatively) cachelines?Isuru H
@lax, I know you've said cacheline granularity does not apply in this case, but I cannot get my head around a valid situation that reflects the problem you are describing here.Isuru H

2 Answers

4
votes

Short

The short answer is that it depends on the architecture, but in theory unaligned operations don't necessarily prevent the architecture from performing store forwarding. As a practical matter, however, the much larger number of forwarding possibilities that unaligned loads operations represent means that forwarding from such locations may not be supported at all, or may be less well supported than the aligned cases.

Long

The long answer is that any particular architecture will have various scenarios they can handle efficiently, and those they cannot.

Old or very simple architectures may not have any store-forwarding capabilities at all. These architectures may not execute out of order at all, or may have some out-of-order capability but may simply wait until all prior stores have committed before executing any load.

The next level of sophistication is an architecture that at least has some kind of CAM to check prior store addresses. This architecture may not have store forwarding, but may allow loads to execute in-order or out-of-order once the load address and all prior store addresses are known (and there is no match). If there is a match with a prior store, the architecture may wait until the store commits before executing the load (which will read the stored value from the L1, if any).

Next up, we have architecture likes the above that wait until prior store addresses are known and also do store forwarding. The behavior is the same as above, except that when a load address hits a prior store, the store data is forwarded to the load without waiting for it to commit to L1.

A big problem with the above is that in the above designs, loads still can't execute until all prior store addresses are known. This inhibits out-of-order execution. So next up, we add speculation - if a load at a particular IP has been observed to not depend on prior stores, we just let it execute (read its value) even if prior store addresses aren't know. At retirement there will be a second check to ensure than the assumption that there was no hit to a prior store was correct, and if not there will be some type of pipeline clean and recovery. Loads that are predicted to hit a prior store wait until the store data (and possibly address) is available since they'll need store-forwarding.1

That's kind of where we are at today. There are yet more advanced techniques, many of which fall under the banner of memory renaming, but as far as I know they are not widely deployed.

Finally, we get to answer your original question: how all of this interacts with unaligned loads. Most of the above doesn't change - we only need to be more precise about what the definition of a hit is, where a load reads data from a previous store above.

You have several scenarios:

  1. A later load is totally contained within a prior store. This means that all the bytes read by a load come from the earlier store.
  2. A later load is partially contained within a prior store. This means that one or more bytes of the load come from an earlier store, but one or more bytes do not.
  3. A later load is not contained at all within any earlier store.

On most platforms, all three possible scenarios exist regardless of alignment. However, in the case of aligned values, the second case (partial overlap) can only occur when a larger store follows a smaller load, and if the platform only supports once size of loads situation (2) is not supported at all.

Theoretically, direct1 store-to-load forwarding is possible in scenario (1), but not in scenarios (2) or (3).

To catch many practical cases of (1), you only need to check that the store and load addresses are the same, and that the load is not larger than the store. This still misses cases where a small load is fully contained in a larger store, whether aligned or not.

Where alignment helps is that the checks above are easier: you need to compare fewer bits of the addresses (e.g., a 32-bit load can ignore the bottom two bits of the address), and there are fewer possibilities to compare: a 4-byte load can only be contained in an 8-byte store in two possible ways (at the store address or the store address + 4), while misaligned operations can be fully contained in five different ways (at a load address offset any of 0,1,2,3 or 4 bytes from the store).

These differences are important in hardware, where the store queue has to look something like a fully-associative CAM implementing these comparisons. The more general the comparison, the more hardware is needed (or the longer the latency to do a lookup). Early hardware may have only caught the "same address" cases of (1), but the trend is towards catching more cases, both aligned and unaligned. Here is a great overview.


1 How best to do this type of memory-dependence speculation is something that WARF holds patents and based on which it is actively suing all sorts of CPU manufacturers.

2 By direct I mean from a single store to a following store. In principle, you might also have more complex forms of store-forwarding that can take parts of multiple prior stores and forward them to a single load, but it isn't clear to me if current architectures implement this.

1
votes

For ISAs like x86, loads and stores can be multiple bytes wide and unaligned. A store may write to one or more bytes where a load may read a subset of them. Or on the other hand the entire data written by a store can be a subset of the data read by a following load.

Now to detect ordering violations (stores checking younger executed loads), this kind of overlap should be detected. I believe this can be accomplished even if there are unaligned accesses involved if the LSQ stores both the address and the size (how many bytes) of each access (along with other information such as its value, PC etc). With this information the LSQ can calculate every addresses that each load reads from/store writes to (address <= "addresses involved in the instruction" < address+size) to find whether the address matches or not.

To add a little bit more on this topic, in reality, the store-to-load forwarding (stores checking younger non-executed loads) might be restricted to certain cases such as for perfectly aligned accesses (I'm sorry I'm not aware of the details implemented in modern processors). Correct forwarding mechanism which can handle most cases (and thus improves performance accordingly) comes at the cost of increased complexity. Remember for unsupported forwarding cases you can always stall the load until the matching store(s) writeback to the cache and the load can read its value directly from the cache.