Problem Set Number 7
Computer Science 471

Due by 5 PM, Monday Nov. 20, 2000

1. Associativity usually improves the miss ratio, but not always. Give a short string of address references for which a 2-way set-associative cache with LRU replacement would experience more misses than a direct-mapped cache of the same size. State the size of your caches.

2. Exercise 7.10 from the text.

3. This question asks you to figure out the numbers and sizes of a few things associated with a cache. Suppose the computer's address size is k bits (using byte addressing), the cache size is S bytes, the block size is B bytes, and the cache is A-way associative. B is of course a power of 2, so let B=2b. Figure out what the following quantities are in terms of S, B, A, b, and k:

4. This question concerns caches of unusual sizes. In this question, you may not introduce any extra logic, not even a mux, in the address path for the cache.

5. (counts as two questions) This question asks you to consider Data Cache design alternatives for a pipelined MIPS implementation. This design has a 5 nanosecond cycle time (a.k.a. 200 MHz), and a separate Instruction Cache, whose design is not at issue. The current design calls for a direct-mapped data cache of 16 Kbytes, and the question is whether a 4-way set-associative cache of the same size might do better.

Assume that loads and stores that miss in the cache face the same penalty: a 20-cycle delay to reference main memory. (In other words, a miss takes 20 cycles longer than a hit.) Assume that 40 percent of all MIPS instructions executed are loads and stores (not 40 percent each--40 percent all together). Assume that the miss ratios for the candidate data caches are 8.67 percent for the direct-mapped cache and 7.1 percent for the 4-way set-associative one (these are actual measured values from some real benchmarks). Assume that loads and stores each have a single delay slot: the very next instruction must be independent of the load or store, and if no such instruction can be found, the programmer or compiler must insert a no-op (see pp. 446-448 of the text).

As we have seen, a set-associative cache can threaten cycle time, and in this case it does. There are two possibilities: stretch the cycle time to 5.5 nanoseconds, or take two cycles, instead of one, for a cache hit. In the latter case, assume that the new extra delay slot after loads and stores can be filled by the compiler with a useful instruction 65 percent of the time. Assume that the base CPI is 1.5, which includes all I-cache effects, the hit time of the original data cache, but not any cycles due to D-cache misses. Assume that no-ops used in delay slots also cost 1.5 CPI on average.

From the point of view of performance, should the cache be changed to 4-way set-associative, and if so, should the cycle time be lengthened or should the hit latency be lengthened? Show all calculations necessary to support your answer. Don't forget to re-calculate the 20-cycle main memory latency (that's 100 nanoseconds, right?) for the longer cycle time: fewer of the longer cycles will make the necessary 100 ns. The number of cycles used to access memory must of course be an integer!

6. (counts as two questions) How good, and how bad, can a cache be? Assume your MIPS computer has a single 1 Kbyte cache that holds both instructions and data. (That K is still binary: 1024 bytes.) As usual, the size of the cache--1 Kbytes--doesn't include the bits used for the tags. This cache is 2-way set-associative with 16-byte blocks and strict LRU replacement within a set. There is an array of 4096 32-bit integers stored in consecutive memory locations, starting at address 0. Please write two MIPS assembly-language programs to compute the sum of these integers and leave it in register 3: first, write a program that will achieve the best possible cache hit ratio; then write another that will achieve the worst possible cache hit ratio.

Both programs should read each element of the array only once; neither should do any other D-stream references; neither should use no-ops; and neither should branch needlessly. Calculate and clearly state the hit ratios in both cases. Do this by combining all instruction and data misses and references--don't report I-stream and D-stream hit ratios separately. Don't forget to specify the starting address of your programs. Assume a non-pipelined processor, so that there are no delayed branches, load-delay slots, flushes, or other pipeline stuff. Assume the cache is initially empty.

Please include helpful comments or explanations along with your programs--don't just hand in uncommented assembly language.