My guess; I'm not familiar with the details of any real-world designs
Even if the store buffer is a full scheduler that can "grab" any graduated store for commit to L1d, I'd assume it would use an oldest-ready first order. (Like an instruction / uop scheduler aka RS Reservation Station.)
"Ready" would mean the cache line is exclusively owned (Modified or Exclusive state). Every graduated store itself is implicitly ready to commit because by definition the associated store instruction has retired.
In-order retirement means that stores become eligible for commit in program-order, so you can't have an older store that's temporarily hidden from the oldest-ready-first scheduling.
Together, those things would ensure that for any given byte, stores overlapping it are in program order and thus maintain consistency of global-visibility order and final value for any given group of bytes within a cache line.
A memory barrier might work by fencing off the store buffer like a divider on a grocery-store checkout conveyor belt, preventing grabbing of stores past it while committing ones to the same place in the same line.
We do know real-world weakly-ordered store buffers like PowerPC RS64-III (in-order exec) and Alpha 21264 (OoO exec) do merging to help them create whole 4-byte or 8-byte aligned commits to L1d, e.g. out of multiple byte stores. That's also fine, assuming your merge algorithm respects order for any given byte e.g. by putting the data from a younger store into an older SB entry or vice versa and marking the other entry as "already committed". Obviously this must respect store barriers.
I think this is all fine even with unaligned stores, although preserving atomicity guarantees for unaligned stores could be tricky with merging. (Intel P6-family and later does provide atomicity guarantees for unaligned cached stores that don't cross a cache-line boundary, but we don't think Intel does merging in the store buffer proper; maybe just some stuff with LFBs for cache-miss back-to-back stores to the same line.)
It's likely that real hardware might not be a full scheduler that can merge any 2 SB entries, e.g. maybe only over limited range to reduce the amount of different addresses (and sizes) to compare at once. Also, you'd probably still only free up SB entries in program order, so it can basically be a circular buffer (unlike the RS). Alloc in program order, and having the order be tracked by the layout of the SB itself, makes it much cheaper for memory barriers to work, and to track where the youngest "graduated" store is.
Disclaimer: IDK if this is exactly how real HW works
Possible corner case: unaligned 4-byte store to [cache_line+63]
(split across a CL boundary) and then to [cache_line+60]
(fully contained in the lower cache line). If the older store-buffer entry can't commit right away because we don't yet own the next cache line, but we do own cache_line
, we still can't let the younger store to cache_line+60
commit first, if we're depending on that not happening to avoid WAW hazards.
So you'd probably want a line-split SB entry to be able to commit the data to one line but not the other, allowing oldest-ready-first to happen for each location separately, not tying together order across 2 cache lines.
Related: I wrote my own answer explaining what a store buffer is. I tried to avoid mistakes like Wikipedia makes ("when a store retires, it writes its value to the memory system": In fact retirement just makes it eligible to commit; such stores are called "graduated" stores.)