Grant Wallace
CS 597d
Observation 8:
Code Generation Schema for Modulo Scheduled Loops [1]
Strengths:
Presents a good motivation for using techniques
that eliminate pre-conditioning in modulo scheduled loops, especially in
VLIW and deep pipeline processors. Figure 5 shows that only in a very narrow
band (of iterations per loop) is the speedup near ideal for these two types
of processors. The curve has the look of an exponential decay, leading
to large bands where speedup is far from ideal, and the problem gets worse
as the modulo number increases (more concurrent iterations in the kernel
schedule).
The modulo number tends to be large for VLIW processors
because the initialization interval is short (it can start an iteration
quickly due to IPL), but it must have enough kernel stages to allow completion
of the iteration. The result is that it is able to begin many iterations
before the first one completes, thus a high modulo number. With deep pipeline
processors, there is a similar effect produced by the long instruction
latencies. Since the latency of an iteration has increased, more iterations
can be started before the first finishes.
Note that the reduction in speedup is because the
pre-condition section has no inter-loop IPL and must execute up to the
modulo number of these iterations before entering the modulo scheduled
loop. The affect that this has on speedup will reduce as the total number
of loop iterations increases (i.e. at iterations = infinite, the speedup
will be ideal no matter what the modulo number). So this leads to the question,
what is the average number of loop iterations, and what is that performance
hit.
Weaknesses:
No benchmark results are shown or discussed. The
implementation needed to avoid pre-conditioning could be expensive it terms
of hardware (predication, speculation, rotating register files), and in
terms of software (compiler complexity, code size). So it would be good
to see how much overall speedup these techniques achieve. Note also that
modulo scheduling can only be done on inner-loops without function calls,
and that the performance hit due to pre-conditioning decreases as iterations
per loop increases. My intuitive guess is that inner-loops without function
calls might be executed many iterations on average.
[1] B. R. Rau, M. S. Schlansker, and P. P. Tirumalai. "Code generation
schema for modulo scheduled loops." In Proceedings of the 25th Annual International
Symposium on Microarchitecture, pages 158-169, December 1992.