Grant Wallace
CS 597d

Observation 8:

Code Generation Schema for Modulo Scheduled Loops [1]

Strengths:
    Presents a good motivation for using techniques that eliminate pre-conditioning in modulo scheduled loops, especially in VLIW and deep pipeline processors. Figure 5 shows that only in a very narrow band (of iterations per loop) is the speedup near ideal for these two types of processors. The curve has the look of an exponential decay, leading to large bands where speedup is far from ideal, and the problem gets worse as the modulo number increases (more concurrent iterations in the kernel schedule).
    The modulo number tends to be large for VLIW processors because the initialization interval is short (it can start an iteration quickly due to IPL), but it must have enough kernel stages to allow completion of the iteration. The result is that it is able to begin many iterations before the first one completes, thus a high modulo number. With deep pipeline processors, there is a similar effect produced by the long instruction latencies. Since the latency of an iteration has increased, more iterations can be started before the first finishes.

    Note that the reduction in speedup is because the pre-condition section has no inter-loop IPL and must execute up to the modulo number of these iterations before entering the modulo scheduled loop. The affect that this has on speedup will reduce as the total number of loop iterations increases (i.e. at iterations = infinite, the speedup will be ideal no matter what the modulo number). So this leads to the question, what is the average number of loop iterations, and what is that performance hit.

Weaknesses:
    No benchmark results are shown or discussed. The implementation needed to avoid pre-conditioning could be expensive it terms of hardware (predication, speculation, rotating register files), and in terms of software (compiler complexity, code size). So it would be good to see how much overall speedup these techniques achieve. Note also that modulo scheduling can only be done on inner-loops without function calls, and that the performance hit due to pre-conditioning decreases as iterations per loop increases. My intuitive guess is that inner-loops without function calls might be executed many iterations on average.

[1] B. R. Rau, M. S. Schlansker, and P. P. Tirumalai. "Code generation schema for modulo scheduled loops." In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 158-169, December 1992.