We know that the dirty victim data is not immediately written back to RAM, it is stashed away in the store buffer and then written back to RAM later as time permits. Also, the store forwarding technique that if you do a subsequent LOAD to the same location on the same core before the value is flushed to the cache/memory, the value from the store buffer will be "forwarded" and you will get the value that was just stored. This can be done in parallel with the cache access, so it doesn’t slow things down.
My question is - With the help of the store buffer and store forwarding, the store misses don’t necessarily require the processor (correspond core) to stall. Therefore, store misses do not contribute to the total cache miss latency, right?
Thanks.