### Problem Set Number 7 Computer Science 471 Due by 5 PM, Monday Nov. 18, 1996

1. (5 points) Associativity usually improves the miss ratio, but not always. Give a short string of address references for which a 2-way set-associative cache with LRU replacement would experience more misses than a direct-mapped cache of the same size.

2. (5 points) Exercise 7.12 from the text.

3. (5 points) This question asks you to figure out the numbers and sizes of a few things associated with a cache. Suppose the computer's address size is k bits (using byte addressing), the cache size is S bytes, the block size is B bytes, and the cache is A-way associative. B is of course a power of 2, so let B=2b. Figure out what the following quantities are in terms of S, B, A, b, and k:

• the number of sets in the cache;
• the number of index bits in the address;
• the number of 2-input exclusive-OR gates needed to perform the address match; and
• the number of bits in the TAG memory (the part of the cache that holds the tag bits)--not the width of the tag memory, the total number of bits contained in the whole thing.

4. (6 points)This question concerns caches of unusual sizes.

• Can you make a fully-associative cache of 3K words? (Remember that a K is 210.) Can you make a set-associative cache of 3K words? How about a direct-mapped cache of 3K words? For each of these: how, or why not?
• Can you make a fully-associative, set-associative, or direct-mapped cache of 300 words? (Remember that 300 is 3 * 102.) For each of these: how, or why not?

5. (15 points) This question asks you to consider Data Cache design alternatives for a pipelined MIPS implementation. This design has a 10 nanosecond cycle time, and a separate Instruction Cache, whose design is not at issue. The current design calls for a direct-mapped data cache of 16 Kbytes, and the question is whether a 4-way set-associative cache of the same size might do better.

Assume that loads and stores that miss in the cache face the same penalty: a 20-cycle delay to reference main memory. Assume that 40 percent of all MIPS instructions executed are loads and stores (not 40 percent each--40 percent all together). Assume that the miss ratios for the candidate data caches are 8.67 percent for the direct-mapped cache and 7.1 percent for the 4-way set-associative one (these are actual measured values from some real benchmarks).

As we have seen, a set-associative cache can threaten cycle time, and in this case it does. There are two possibilities: stretch the cycle time to 11 nanoseconds, or take two cycles, instead of one, for a cache hit. In the latter case, assume that the new extra delay slot after loads and stores can be filled by the compiler with a useful instruction 65 percent of the time. Assume that the base CPI is 1.5, which includes all I-cache effects, the hit time of the original data cache, but not any cycles due to D-cache misses. Assume that no-ops used in delay slots also cost 1.5 CPI on average.

From the point of view of performance, should the cache be changed to 4-way set-associative, and if so, should the cycle time be lengthened or should the hit latency be lengthened? Show all calculations necessary to support your answer. Don't forget to re-calculate the 20-cycle main memory latency (that's 200 nanoseconds, right?) for the longer cycle time: fewer of the longer cycles will make the necessary 200 ns. The number of cycles used to access memory must of course be an integer!

6. (15 points) How good, and how bad, can a cache be? Assume your MIPS computer has a single 1 Kbyte cache that holds both instructions and data. This cache is 2-way set-associative with 16-byte blocks and strict LRU replacement within a set. There is an array of 4096 32-bit integers stored in consecutive memory locations, starting at address 0. Please write two programs to compute the sum of these integers and leave it in register 3: first, write a program that will achieve the best possible cache hit ratio; then write another that will achieve the worst possible cache hit ratio.

Both programs should read each element of the array only once; neither should do any other D-stream references; neither should use no-ops; and neither should branch needlessly. Calculate and clearly state the hit ratios in both cases, and don't bother to treat the I-stream and D-stream separately. Don't forget to specify the starting address of your program. Assume a non-pipelined processor, so that there are no delayed branches, load-delay slots, flushes, or other pipeline stuff. Assume the cache is initially empty

7. (5 points) Time to tell the fellas what you thought of their chapter 6. Please do the usual to http://www.cs.princeton.edu/courses/cs471/chap6.txt. To get the 5 points for this question, just write down the time you sent your message to the publisher.

8. (1 point) How long did this take you, not counting the reading, and with whom did you work?