Problem Set Number 9
Computer Science 471

Due by 5 PM, Wednesday (not Monday) Dec. 6, 2000

Questions 1 and 2 both ask you to write programs using only the 11 instructions from the MIPS IV architecture that are described in the R10000 handout, section 3 (beq, beql, dadd, daddi, ld, movn, movz, slt, slti, pref, sd). Question 1 further restricts this subset. Remember that registers are now 64 bits wide, as are virtual addresses. Here is the initial situation for both questions:

Both questions ask you to write a program that computes the sum of all elements of V that are less than C, leaving the result in register $11. Both programs will be graded on correctness and efficiency. Don't be distracted by thinking about out-of-order execution; assume your programs are executed one instruction at a time, strictly in order.

Please, please include helpful comments with your code!

1. First, write this program without using the instructions beql, movn, movz, and pref. Don't unroll the loop or do software pipelining or other advanced trickery; the simple, straightforward, correct answer is all we're looking for.

2. (counts double) Your solution to Question 1 probably has a branch in the middle of the loop, based on whether the vector element was less than C or not. If the elements of V are distributed randomly around C, that branch is not likely to be very predictable by the sophisticated mechanisms used in modern processors like the R10000. Mispredicted branches can be quite costly. The conditional move instructions movn and movz can help.

Your solution to Question 1 will also suffer from cache misses as the elements of V are loaded from memory. The prefetch instruction pref is designed to help with this problem.

Please write another MIPS IV program to solve the same problem, but use any of the handout's subset instructions you like, and try to avoid the branch-mispredict and cache miss problems noted above. Again, no loop unrolling or traditional software pipelining, but your solution may slightly resemble a software-pipelined loop. (If you're not sure what that is, you don't need to worry about it.) Assume your cache has 8-byte blocks and contains no elements of V when your program starts. Assume that the miss penalty of your cache is about twice the time of a single iteration of your loop. Use the "load" hint in your prefetch instructions. Don't worry about cache delays in the first few iterations, as your loop reaches its steady state.