How does MIPS r10000 fetch hide instruction cache latency?

Question

I am studying different pipeline stages of mips r10000. The paper says that processor fetches 4 instructions per cycle from instruction cache each time. But the latency from the instruction cache must be more than one cycle, though I don't know the exact hit latency of instruction cache, hit latency of L1 data cache in Haswell processor is about 4 cycles.

So if we assume L1 instruction cache latency is 3-4 cycles how can the processor fetch 4 instructions each cycle?

The MIPS operates at 200 MHz while Haswell can operate at more than 3 GHz. So 4 cycles in Haswell is a much smaller time than 1 cycle in in the R10000. It's typical for processors made in the 90s to have cache access latencies of 1 or 2 cycles. — Hadi Brais
The refernce you linked only says that the L1 data cache has an an access latency of 4 cycles. That does not necessarily mean that the L1 instruction cache has the same latency. Although both of the them are pipelined o that an access request can be performed each cycle. The instruction cache in Haswell has a throughput of 16 bytes per cycle. In processors that support speculative execution such as Haswell and MIPS R10000, the L1I latency matters only for the branch mispredict penalty... — Hadi Brais
...I don't know whether the R10000 uses a pipelined design design for the L1I to achieve a 1 cycle latency for instruction cache (which is why I didn't post answer), but it's possible that it's latency is 2 cycles (like it's data cache I think) but it's pipelined so that a request can be performed every cycle. — Hadi Brais
In general, the instruction cache must be able to supply at least one instruction every cycle (in terms of thoughput, not necessarily latency). Otherwise, the clock frequency would be too high and would just waste power, which is a nonsensical design. This is basically the fetch stage of the pipeline, which is the first stage. In in-order pipelines or lower-power CPUs where the number of load buffers is small, it's critical for the L1D cache to have minimal latency. For example, Intel Atom processors all have an L1D latency of 3 cycles. — Hadi Brais

Paul A. Clayton Paul A. Clayton · Accepted Answer · 2018-09-09T18:11:36

The MIPS R10000 had a single cycle latency instruction cache and could fetch a contiguous block of four instructions within a cache block without alignment constraints.

Mechanically, this probably meant that it used four SRAM banks with at least partially independent addressing (the cache set address decode could be shared).

Since each bank is independently addressable, as can be seen in the diagram any contiguous sequence of four words contained in the sixteen words can be accessed. Addressing rows [0, 0, 0, 0] gets words [0, 1, 2, 3] (words 0-3); rows [1, 0 , 0, 0] gets words [4, 1, 2, 3] (words 1-4); rows [1, 1, 0, 0] gets words [4, 5, 2, 3] (words 2-5); ...; rows [3, 3, 3, 2] gets words [12, 13, 14, 11] (words 11-14); rows [3, 3, 3, 3] gets words [12, 13, 14, 15] (words 12-15).

(The same banking could cross cache block boundaries, but then two cache block hits would have to be confirmed in parallel. Memoization of the way for the previous access would reduce this to one set check for a common case of sequential accesses in largish cache blocks; one set would use the memoized way and the other would perform the normal check when entering a new cache block. Page crossing is a similar issue.)

(A common alternative for multiple instruction fetch does have an alignment constraint of a naturally aligned chunk of, e.g., 16 bytes.)

This processor did not redirect instruction fetch until a branch was detected in the second pipeline stage (decode), so a taken branch introduced a one cycle bubble even with a correct prediction. An incorrect prediction might not be determined until some cycles later because execution started in the fourth pipeline stage and instructions were executed out-of-order. (An incorrectly predicted taken branch could decode instructions already fetched in the taken branch bubble as these were stored in a "resume cache".)

Buffering of instructions can smooth out such hazards as throughput rarely approaches the maximum because of data dependencies and other hazards.

In general, a cache can provide multiple words per fetch (a natural alignment restriction facilitates a single bank providing the chunk) or be accessed multiple times per cycle (e.g., more deeply pipelining the instruction cache than other parts of the pipeline or using expensive multiported SRAM).

As long as a new address is provided every cycle, a fetch of multiple contiguous instructions can be done every cycle. If two addresses are available (predicted) per cycle, instructions after a taken branch could be fetched in the same cycle. (Another method of reducing the taken branch penalty — and providing other post-branch optimization opportunities — is to use a trace cache.)

How does MIPS r10000 fetch hide instruction cache latency?

1 Answers