## Topic 14: Scheduling

## COS 320

# Compiling Techniques 

Princeton University Spring 2016

Lennart Beringer

## The Back End



## The Back End:

1. Maps infinite number of virtual registers to finite number of real registers $\rightarrow$ register allocation
2. Removes inefficiencies introduced by front-end $\rightarrow$ optimizer
3. Removes inefficiencies introduced by programmer $\rightarrow$ optimizer
4. Adjusts pseudo-assembly composition and order to match target machine $\rightarrow$ scheduler

## Starting point

$$
\begin{array}{ll}
1 & r 1=r 0+0 \\
2 & r 2=M[F P+A] \\
3 & r 3=r 0+4 \\
4 & r 4=M[F P+X] \\
& \\
& \text { LOOP: } \\
1 & r 5=r 3 * r 1 \\
2 & \\
3 & r 5=r 2+r 5 \\
4 & M[r 5]=r 4 \\
5 & r 1=r 1+1 \\
6 & B R r 1<=10, L O O P
\end{array}
$$

## Starting point

| 1 | $r 1$ | $=r 0+0$ |
| ---: | :--- | ---: | :--- |
| 2 | $r 2$ | $=M[F P+A]$ |
| 3 | $r 3$ | $=r 0+4$ |
| 4 | $r 4$ | $=M[F P+X]$ |

LOOP:
$1 \quad r 5=r 3$ * r1
$\begin{aligned} & 2 \\ & 3\end{aligned} r 5=r 2+r 5$
$4 \mathrm{M}[r 5]=r 4$
$5 \quad r 1=r 1+1$
6 BR r1 <= 10, LOOP
Multiplication takes 2 cycles

## Motivating example

## Instructions take multiple cycles:

fill empty slots with independent instructions!

| 1 | $r 1=r 0+0$ |
| :--- | :--- |
| 2 | $r 2=M[F P+A]$ |
| 3 | $r 3=r 0+4$ |
| 4 | $r 4$ |


| 1 | $r 1=r 0+0$ |
| :--- | :--- |
| 2 | $r 2=M[F P+A]$ |
| 3 | $r 3=r 0+4$ |
| 4 | $r 4=M[F P+X]$ |

LOOP:

| 1 | $r 5=r 3 * r 1$ |
| :--- | :--- |
| 2 |  |
| 3 | $r 5=r 2+r 5$ |
| 4 | $M[r 5]=r 4$ |
| 5 | $r 1=r 1+1$ |
| 6 | $B R r 1<=10$, LOOP |

LOOP :

| 1 | $r 5=r 3 * r 1$ |
| :--- | :--- |
| 2 | $r 1=r 1+1$ |
| 3 | $r 5=r 2+r 5$ |
| 4 | $M[r 5]=r 4$ |
| 5 | $B R r 1<=10$, LOOP |

## Motivating example

## Instructions take multiple cycles:

fill empty slots with independent instructions!

| 1 | $r 1=r 0+0$ |
| :--- | :--- |
| 2 | $r 2=M[F P+A]$ |
| 3 | $r 3=r 0+4$ |
| 4 | $r 4$ |

$1 \quad r 1=r 0+0$
$2 \mathrm{r} 2=\mathrm{M}[\mathrm{FP}+\mathrm{A}]$
3 r3 = r0 + 4
$4 \quad r 4=\mathrm{M}[F P+X]$
LOOP :

| 1 | $r 5=r 3 * r 1$ |
| :--- | :--- |
| 2 |  |
| 3 | $r 5=r 2+r 5$ |
| 4 | $M[r 5]=r 4$ |
| 5 | $r 1=r 1+1$ |
| 6 | $B R r 1<=10, L O O P$ |

LOOP :
$1 \quad r 5=r 3$ * r1
$r 1=r 1+1$
$r 5=r 2+r 5$
$\mathrm{M}[\mathrm{r} 5]=\mathrm{r} 4$
5 BR r1 <= 10, LOOP

## Motivating example

## When our processor can execute 2 instructions per cycle

```
r1 = r0 + 0
    r2 = M[FP + A]
    r3 = r0 + 4
    r4 = M[FP + X]
    LOOP :
    r5 = r3 * r1
    r1 = r1 + 1
    r5 = r2 + r5
    M[r5] = r4
    BR r1 <= 10, LOOP
```


## Motivating example

When our processor can execute 2 instructions per cycle: issue pairs of independent instructions whenever possible

```
1 r1 = r0 + 0
r2 = M[FP + A]
    1 r1 = r0 + 0 r2 = M[FP + A]
r3 = r0 + 4
r4 = M[FP + X]
2 r3 = r0 + 4,r4 = M[FP + X]
LOOP :
1 r5 = r3 * r1
2 r1 = r1 + 1
3 r5 = r2 + r5
    M[r5] = r4
    5 BR r1 <= 10, LOOP
    LOOP:
4 M[r5] = r4
    3 r5 = r2 + r5
```


## Motivating example

When our processor can execute 2 instructions per cycle: issue pairs of independent instructions whenever possible
same notion of "independent"?


## Instruction Level Parallelism

- Instruction-Level Parallelism (ILP), the concurrent execution of independent assembly instructions. The concurrently executed instructions stem from a single program.
- ILP is a cost effective way to extract performance from programs.
- Exploiting ILP requires global optimization and scheduling.
- Processors can execute several instructions per cycle (Ithanium: up to 6)
- ILP/VLIW: dependencies identified by compiler $\rightarrow$ instruction bundles
- Super-Scalar: dependencies identified by processor (instruction windows) Advantages / Disadvantages?


## Instruction Level Parallelism

- Instruction-Level Parallelism (ILP), the concurrent execution of independent assembly instructions. The concurrently executed instructions stem from a single program.
- ILP is a cost effective way to extract performance from programs.
- Exploiting ILP requires global optimization and scheduling.
- Processors can execute several instructions per cycle (Ithanium: up to 6)
- ILP/VLIW: dependencies identified by compiler $\rightarrow$ instruction bundles
- Super-Scalar: dependencies identified by processor (instruction windows) Advantages / Disadvantages?


## Possible synthesis:

- have compiler take care of register-carried dependencies
- let processor take care of memory-carried dependencies: exploit dynamic resolution of memory aliasing
- use register renaming, register bypassing, out-of-order execution, speculation (branch prediction) to keep all execution units busy


## Scheduling constraints

- Data dependencies
- ordering between instructions that arises from the flow of data
- Control dependencies
- ordering between instructions that arises from flow of control
- Resource constraints
- processors have limited number of functional units
- not all functional units can execute all instructions (Floating point unit versus Integer-ALU, ...)
- only limited number of instructions can be issued in one cycle
- only a limited number of register read/writes can be done concurrently


## Data Dependences

- A data dependence is a constraint on scheduling arising from the flow of data between two instructions. Types:
- RAW: An instruction $u$ is flow-dependent on a preceding instruction $d$ if $u$ consumes a value computed by $d$.


## Read After Write

"True" dependence: arises from actual flow of values


## Data Dependences

- A data dependence is a constraint on scheduling arising from the flow of data between two instructions. Types:
- RAW: An instruction $u$ is flow-dependent on a preceding instruction $d$ if $u$ consumes a value computed by $d$.
- WAR: An instruction $d$ is anti-dependent on a preceding instruction $u$ if $d$ writes to a location read by $u$.


## Read After Write Write After Read

"True" dependence: arises from actual flow of values


## Data Dependences

- A data dependence is a constraint on scheduling arising from the flow of data between two instructions. Types:
- RAW: An instruction $u$ is flow-dependent on a preceding instruction $d$ if $u$ consumes a value computed by $d$.
- WAR: An instruction $d$ is anti-dependent on a preceding instruction $u$ if $d$ writes to a location read by $u$.
- WAW: An instruction $d_{2}$ is output-dependent on a preceding instruction $d_{1}$ if $d_{1}$ writes to a location also written by $d_{2}$.

Read After Write Write After Read Write After Write

## "True" dependence: arises

 from actual flow of values

## Data Dependences

- A data dependence is a constraint on scheduling arising from the flow of data between two instructions. Types:
- RAW: An instruction $u$ is flow-dependent on a preceding instruction $d$ if $u$ consumes a value computed by $d$.
- WAR: An instruction $d$ is anti-dependent on a preceding instruction $u$ if $d$ writes to a location read by $u$.
- WAW: An instruction $d_{2}$ is output-dependent on a preceding instruction $d_{1}$ if $d_{1}$ writes to a location also written by $d_{2}$.
- Types of data:
- Register dependence
- Memory dependence
"True" dependence: arises from actual flow of values

Read After Write Write After Read Write After Write


## Eliminating false dependencies

WAW and WAR dependencies can often eliminated by register renaming...


TRUE :

... at the cost of adding registers...

## Eliminating false dependencies

WAW and WAR dependencies can often eliminated by register renaming...

... at the cost of adding registers...

## Eliminating false dependences

WAR dependencies can often be replaced by RAW dependencies


TRUE :

... at the price of using yet another register, and a (move) instruction ....

## Eliminating false dependences

WAR dependencies can often be replaced by RAW dependencies

TRUE :


$r 4=r 5-1$

... at the price of using yet another register, and a (move) instruction ....

## Eliminating false dependences

WAR dependencies can often be replaced by RAW dependencies

TRUE :


TRUE :

$$
18=-5
$$

$r 4=r 5-1$
... at the price of using yet another register, and a (move) instruction ....

## Control Dependence

Node y is control dependent on x if

- x is a branch, with successors $u$,
- y post-dominates u in the CFG: each path from u to EXIT includes y
- $y$ does not post-dominate $v$ in the CFG: there is a path from $v$ to EXIT that avoids y

Schedule must respect control dependences: don't move instructions past their control dependence ancestors!


## Dependences

## Latency

- Amount of time after the execution of an instruction that its result is ready.
- An instruction can have more than one latency! eg load, depending on cache-hit/miss


## Data Dependence Graph

- A data dependence graph consists of instructions and a set of directed data dependence edges among them in which each edge is labeled with its latency and type of dependence.
- Scheduling (code motion) must respect dependence graph.


## Program dependence graph: overlay of data dependence graph with control dependencies (two kinds of edges)

## Hardware Scheduling

## Machines can also do scheduling...

- hardware schedulers process code after it has been fetched
- hardware finds independent instructions
- works with legacy architectures (found in x86 / Pentium)
- program knowledge more precise at run-time - memory dependence

But compiler still important.

- control flow resolved
- Hardware schedulers have a small window.
- Hardware complexity increases.
- Hardware does not benefit directly from compiler optimization.


Modern processors:

- many more stages (up to 20-30)
- different stages take different number of cycles per instruction
- some (components of) stages duplicated, eg super-scalar

Common characteristics: resource constraints

- each stage can only hold a fixed number of instruction per cycle
- but: instructions can be in-flight concurrently (pipeline - more later)
- register bank can only serve small number of reads/writes per cycle


## Goal of scheduling

Construct a sorted version of the dependence graph that

- produces the same result as the sequential program: respect dependencies, latencies
- obeys the resource constrains
- minimizes execution time (other metrics possible)


## Goal of scheduling

Construct a sorted version of the dependence graph that

- produces the same result as the sequential program: respect dependencies, latencies
- obeys the resource constrains
- minimizes execution time (other metrics possible)


## Solution formulated as a table that indicates the issue cycle of each instruction:

| Cycle | Resoure 1 | Resource 2 | $\ldots$ | Resource $\mathbf{n}$ |
| :---: | :---: | :---: | :---: | :---: |
| 1 | 1 |  |  | 2 |
| 2 |  | 3 |  | 4 |
| 3 |  |  |  |  |
| $:$ |  |  |  |  |

Even simplified version of the scheduling problem are typically NP-hard $\rightarrow$ heuristics

## A classification of scheduling heuristics

## Schedule within a basic block (local)

- instructions cannot move past basic block boundaries
- schedule covers only one basic block Example technique: (priority) list scheduling

```
x = ...
y=...
M[z] = ...
```


## A classification of scheduling heuristics

## Schedule within a basic block (local)

- instructions cannot move past basic block boundaries
- schedule covers only one basic block

Example technique: (priority) list scheduling

$$
\begin{aligned}
& x=. \\
& y=. \\
& M[z]
\end{aligned}
$$



## Scheduling across basic blocks (global)

- instructions move past basic block boundaries
- schedule typically covers a (frequently executed) trace Example technique: trace scheduling


## A classification of scheduling heuristics

## Schedule within a basic block (local)

- instructions cannot move past basic block boundaries
- schedule covers only one basic block

Example technique: (priority) list scheduling

$$
\begin{aligned}
& x=. \\
& y= \\
& M[z]
\end{aligned}
$$



## Scheduling across basic blocks (global)

- instructions move past basic block boundaries
- schedule typically covers a (frequently executed) trace Example technique: trace scheduling


## Loop scheduling

- instructions cannot move past basic block boundaries - each schedule covers body of a loop
- exploits/reflects pipeline structure of modern processors Example technique: SW pipelining, modulo scheduling



## Local scheduling: list scheduling

## Advantage: can disregard control dependencies

Input: - data dependence graph of straight-line code, annotated with (conservative) latencies

- instruction forms annotated with suitable type of Functional Units
- \#available Functional Units of each type


| Integer-ALU | FP | MEM |
| :---: | :---: | :---: |
| 2 | 1 | 1 |

## Local scheduling: list scheduling

Advantage: can disregard control dependencies
Input: - data dependence graph of straight-line code, annotated with (conservative) latencies

- instruction forms annotated with suitable type of Functional Units
- \#available Functional Units of each type


| Integer-ALU | FP | MEM |
| :---: | :---: | :---: |
| $\mathbf{2}$ | $\mathbf{1}$ | 1 |

## Output: cycle-accurate assignment of instructions to functional units

| Cycle | ALU1 | ALU2 | FP | MEM |
| :---: | :--- | :--- | :--- | :--- |
| 1 |  |  |  |  |
| 2 |  |  |  |  |
| 3 |  |  |  |  |
| 4 |  |  |  |  |
| 5 |  |  |  |  |
| 6 |  |  |  |  |

Can be refined for pipelined architectures, where latency != reservation period for FU

## Local scheduling: list scheduling

Advantage: can disregard control dependencies
Input: - data dependence graph of straight-line code, annotated with (conservative) latencies

- instruction forms annotated with suitable type of Functional Units
- \#available Functional Units of each type


| Integer-ALU | FP | MEM |
| :---: | :---: | :---: |
| $\mathbf{2}$ | 1 | 1 |

## Output: cycle-accurate assignment of instructions to functional units

| Cycle | ALU1 | ALU2 | FP | MEM |
| :---: | :---: | :---: | :---: | :---: |
| 1 | 1 |  |  | 2 |
| 2 |  |  |  |  |
| 3 |  |  |  |  |
| 4 |  | 3 |  | 4 |
| 5 |  |  |  |  |
| 6 |  |  | 5 | 6 |

Can be refined for pipelined architectures, where latency != reservation period for FU

## List scheduling: algorithm (sketch)

1. Insert nodes that have no predecessors into queue
2. Start with cycle count $\mathrm{c}=1$

## List scheduling: algorithm (sketch)

1. Insert nodes that have no predecessors into queue
2. Start with cycle count $\mathrm{c}=1$
3. While queue not empty:
priority: e.g. length of path to EXIT, maybe weighted by latency of RAW (+WAW/WAR?) deps

- select an instruction i from the queue such that all predecessors were scheduled "sufficiently long ago" (latency information)


## List scheduling: algorithm (sketch)

1. Insert nodes that have no predecessors into queue
2. Start with cycle count $\mathrm{c}=1$
3. While queue not empty:
priority: e.g. length of path to EXIT, maybe weighted by latency of RAW (+WAW/WAR?) deps

- select an instruction i from the queue such that all predecessors were scheduled "sufficiently long ago" (latency information)
- if a functional unit u for $i$ is available:
- insert i in (c, u), and remove it from the queue
- insert any successor of i into queue for which all predecessors have now been scheduled


## List scheduling: algorithm (sketch)

1. Insert nodes that have no predecessors into queue
2. Start with cycle count $\mathrm{c}=1$
3. While queue not empty:
priority: e.g. length of path to EXIT, maybe weighted by latency of RAW (+WAW/WAR?) deps

- select an instruction ifrom the queue such that all predecessors were scheduled "sufficiently long ago" (latency information)
- if a functional unit u for $i$ is available:
- insert i in (c, u), and remove it from the queue
- insert any successor of i into queue for which all predecessors have now been scheduled
- if no functional unit is available for i , select another instruction


## List scheduling: algorithm (sketch)

1. Insert nodes that have no predecessors into queue
2. Start with cycle count $\mathrm{c}=1$
3. While queue not empty:
priority: e.g. length of path to EXIT, maybe weighted by latency of RAW (+WAW/WAR?) deps

- select an instruction i from the queue such that all predecessors were scheduled "sufficiently long ago" (latency information)
- if a functional unit u for $i$ is available:
- insert i in (c, u), and remove it from the queue
- insert any successor of i into queue for which all predecessors have now been scheduled
- if no functional unit is available for i , select another instruction
- if no instruction from the queue was scheduled, increment c


## List scheduling: algorithm

1. Insert nodes that have no predecessors into queue
2. Start with cycle count $\mathrm{c}=1$
3. While queue not empty:
priority: e.g. length of path to EXIT, maybe weighted by latency of RAW (+WAW/WAR?) deps

- select an instruction i from the queue such that all predecessors were scheduled "sufficiently long ago" (latency information)
- if a functional unit u for $i$ is available:
- insert i in (c, u), and remove it from the queue
- insert any successor of i into queue for which all predecessors have now been scheduled
- if no functional unit is available for i , select another instruction
- if no instruction from the queue was scheduled, increment c

Variation: - start at nodes without successors and cycle count LAST

- work upwards, entering finish times of instructions in table
- availability of FU's still governed by start times


## Trace scheduling

Observation: individual basic blocks often don't have much ILP

- speed-up limited
- many slots in list schedule remain empty: poor resource utilization
- problem is accentuated by deep pipelines, where many instructions could be concurrently in-flight

Q: How can we extend scheduling to many basic blocks?

Observation: individual basic blocks often don't have much ILP

- speed-up limited
- many slots in list schedule remain empty: poor resource utilization
- problem is accentuated by deep pipelines, where many instructions could be concurrently in-flight
Q: How can we extend scheduling to many basic blocks?
A: By considering sets of basic blocks that are often executed together
- select instructions along frequently executed traces
e.g. by profiling, counting the acyclic path through CFG
traversals of each CFG edge
- schedule trace members using list scheduling
- adjust off-trace code to deal with executions that only traverse parts of the trace

Trace scheduling


1. construct data dependence graph of instructions on trace, but consider liveln's of $\mathbf{A}$ to be read by $b$.

A trace t , and its neighbors.

Trace scheduling


## Trace scheduling



Trace scheduling


## Trace scheduling: compensation code S




In step 2, some instructions in B1 end up above s in B, others below.
Copy the latter ones into the edge $s \rightarrow \mathrm{~A}$, into a new block $S$ so that they're executed when control flow follows $\mathrm{B} 1 \rightarrow \mathrm{~s} \rightarrow \mathrm{~A}$, but not when A is entered through a different edge.


## Trace scheduling: adjust code jumping to j



In step 2, some instructions in B2 end up above j in B, others below.


## Trace scheduling: adjust code jumping to j



In step 2, some instructions in B2 end up above j in B, others below.
Adjust the jump in $C$ to point to the first instruction (bundle) following the last instruction in B that stems from B2 - call the new jump target j'. Thus yellow instructions remain non-executed if control enters $B$ from C : all instructions from B2 are above j'.


Note: if there's no yellow instruction, we're in fact adjusting j upwards: j ' follows the last purple instruction.

## Trace scheduling: adjusting code jumping into B



Next, some instructions from B3/B4 end up above j' in B, others below.

## Trace scheduling: adjust code jumping to j



Next, some instructions from B3/B4 end up above j' in B, others below. Copy the former ones into the edge $C \rightarrow$ j', into a new block J, ensuring that instructions following j' receive correct data when flow enters $B$ via $C$.


## Trace scheduling: cleaning up S and J



Next, some instructions from B3/B4 end up above j' in B, others below. Copy the former ones into the edge $C \rightarrow$ j', into a new block J, ensuring that instructions following j' receive correct data when flow enters B via C.


Final cleanup: some instructions in S and J may be dead - eliminate them. Then, $S$ and $J$ can be (list-)scheduled or be part of the next trace.

## Pipelining

Purely sequential execution:

| FETCH | DECODE | EXECUTE | MEM | WRITE | FETCH | DECODE | EXECUTE | MEM |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

Pipelining - can partially overlap instructions:

| FETCH | DECODE | EXECUTE | MEM | WRITE |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | FETCH | DECODE | EXECUTE | MEM | WRITE |  |
|  |  | FETCH | DECODE | EXECUTE | MEM | WRITE |

One instruction issued (and retired) each cycle - speedup $\approx$ pipeline depth

## Pipelining

Purely sequential execution:

| FETCH | DECODE | EXECUTE | MEM | WRITE | FETCH | DECODE | EXECUTE | MEM |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

Pipelining - can partially overlap instructions:

| FETCH | DECODE | EXECUTE | MEM | WRITE |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | FETCH | DECODE | EXECUTE | MEM | WRITE |  |
|  |  | FETCH | DECODE | EXECUTE | MEM | WRITE |

One instruction issued (and retired) each cycle - speedup $\approx$ pipeline depth
... assuming that - each instruction spends one cycle in each stage

- all instruction forms visit same (sequence of) FU's
- there are no (data) dependencies


## Pipelining for realistic processors

Different instructions visit different sets/sequences of functional units, and occasionally multiple types of functional units in the same cycle:

Example: floating point instructions on MIPS R4000 (ADD, MUL, CONV)

| FETCH | READ | UNPACK | SHIFT | ROUND | ROUND | WRITE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  | ADD | ADD | SHIFT |  |


| FETCH | READ | UNPACK | MULTA | MULTA | MULTA | MULTB | MULTB <br> ADD | ROUND | WRITE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  |  |  |  |
| FETCH | READ | UNPACK | ADD | ROUND | SHIFT | SHIFT | ADD | ROUND | WRITE |

## Pipelining for realistic processors

Different instructions visit different sets/sequences of functional units, and occasionally multiple types of functional units in the same cycle:

Example: floating point instructions on MIPS R4000 (ADD, MUL, CONV)

| FETCH | READ | UNPACK | $\frac{\text { SHIFT }}{\text { ADD }}$ | $\frac{\text { ROUND }}{\text { ADD }}$ | $\frac{\text { ROUND }}{\text { SHFT }}$ | WRITE |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| FETCH | READ | UNPACK | MULTA | MULTA | MULTA | MULTB | $\frac{\text { MULTB }}{\text { ADD }}$ | ROUND | WRITE |
| FETCH | READ | UNPACK | ADD | ROUND | SHIFT | SHIFT | ADD | ROUND | WRITE |

Contention for FU's means some pipelinings must be avoided:

| FETCH | READ | UNPACK | MULTA | MULTA | MULTA | MULTB | MULTB | ROUND | WRITE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  | ADD |  |  |
| $\bullet \cdot$ |  |  |  |  |  |  |  |  |  |
|  |  |  | FETCH | READ | UNPACK | SHIFT | ROL VD | ROUND | WRITE |
|  |  |  |  |  |  | ADD | ADD | SHIFT |  |

## Pipelining constraints: data dependencies

RAW dependency:


## Pipelining constraints: data dependencies

## RAW dependency:



Register bypassing / operand forwarding: extra HW to communicate data directly between FU's


Result of one stage is available at another stage in the next cycle.

## Loop scheduling without resource bounds

- illustrates use of loop unrolling and introduces
terminology for full SW pipelining
- but not useful in practice

```
for i}<<1\mathrm{ to N
    a<j\diamondV[i-1]
    b\leftarrowa\diamondf
    c\leftarrowe}\diamond
    d}\leftarrowf\diamond
    e\leftarrowb\diamondd
    f<U[i]
    g:V[i]<b
    h:W[i]&d
    j\leftarrowX[i]
```


# scalar replacement 

make "iteration
index" explicit
$\diamond$ some binary op(s)

## Loop scheduling without resource bounds

- illustrates use of loop unrolling and introduces
terminology for full SW pipelining
- but not useful in practice

$$
\begin{aligned}
& \text { for } i \leftarrow 1 \text { to } N \\
& a \leftarrow j \diamond V[i-1] \\
& b \leftarrow a \diamond f \\
& c \leftarrow e \diamond j \\
& d \leftarrow f \diamond c \\
& e \leftarrow b \diamond d \\
& f \leftarrow U[i] \\
& g: V[i] \leftarrow b \\
& h: W[i] \leftarrow d \\
& j \leftarrow X[i]
\end{aligned}
$$

$$
\begin{aligned}
& \text { for } \mathrm{i} \leftarrow 1 \text { to } \mathrm{N} \\
& a_{i} \leftarrow j_{i-1} \diamond b_{i-1} \\
& b_{i} \leftarrow a_{i} \diamond f_{i-1} \\
& c_{i} \leftarrow \mathrm{e}_{\mathrm{i}-1} \diamond \mathrm{j}_{\mathrm{i}-1} \\
& d_{i} \leftarrow f_{i-1} \diamond c_{i} \\
& e_{i} \leftarrow b_{i} \diamond d_{i} \\
& \mathrm{f}_{\mathrm{i}} \leftarrow \mathrm{U}[\mathrm{i}] \\
& \mathrm{g}: \mathrm{V}[\mathrm{i}] \leftarrow \mathrm{b}_{\mathrm{i}} \\
& \mathrm{~h}: \mathrm{W}[\mathrm{i}] \leftarrow \mathrm{d}_{\mathrm{i}} \\
& \mathrm{j}_{\mathrm{i}} \leftarrow \mathrm{X}[\mathrm{i}]
\end{aligned}
$$

index" explicit
scalar replacement
make "iteration

## Loop scheduling without resource bounds

## Data dependence graph of body

$$
\begin{aligned}
& \text { for } i \leftarrow 1 \nleftarrow N \\
& a_{i} \leftarrow j_{i-1} \diamond b_{i-1} \\
& b_{i} \leftarrow a_{i} \diamond f_{i-1} \\
& c_{i} \leftarrow e_{i-1} \diamond j_{i-1} \\
& d_{i} \leftarrow f_{i-1} \diamond c_{i} \\
& e_{i} \leftarrow b_{i} \diamond d_{i} \\
& f_{i} \leftarrow U[i] \\
& g_{i} V[i] \leftarrow b_{i} \\
& h: W[i] \leftarrow d_{i} \\
& j_{i} \leftarrow X[i]
\end{aligned}
$$


same-iteration dependence
cross-iteration dependence

## Loop scheduling without resource bounds

## Data dependence graph of unrolled body - acyclic!



## Loop scheduling without resource bounds

## Arrange in tableau

- rows: cycles
- columns: iterations

|  | 1 | 2 | 3 | 4 | 5 | 6 |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | acfj | $f j$ | $f j$ | $f j$ | $f j$ | $f j$ |
| 2 |  |  |  |  |  |  |
| 3 |  |  |  |  |  |  |
| 4 |  |  |  |  |  |  |
| 5 |  |  |  |  |  |  |
| 6 |  |  |  |  |  |  |
| 7 |  |  |  |  |  |  |
| 8 |  |  |  |  |  |  |
| 9 |  |  |  |  |  |  |
| 10 |  |  |  |  |  |  |
| 11 |  |  |  |  |  |  |
| 12 |  |  |  |  |  |  |
| 13 |  |  |  |  |  |  |
| 14 |  |  |  |  |  |  |
| 15 |  |  |  |  |  |  |

## Loop scheduling without resource bounds

| Arrange in tableau |  | 1 | 2 | 3 | 4 | 5 | 6 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 1 | acfj | f j | f j | f ${ }^{\text {j }}$ | f j | f j |
|  | 2 | $b \mathrm{~d}$ |  |  |  |  |  |
|  | 3 |  |  |  |  |  |  |
|  | 4 |  |  |  |  |  |  |
|  | 5 |  |  |  |  |  |  |
| $b_{1}$ | 6 |  |  |  |  |  |  |
| $\cdots 1 \mathrm{~h}_{1}$ | 7 |  |  |  |  |  |  |
| $g_{1}$ | 8 |  |  |  |  |  |  |
| $a$ | 9 |  |  |  |  |  |  |
|  | 10 |  |  |  |  |  |  |
| $+d_{2}$ | 11 |  |  |  |  |  |  |
|  | 12 |  |  |  |  |  |  |
|  | 13 |  |  |  |  |  |  |
| $\quad-e_{2}$ | 14 |  |  |  |  |  |  |
|  | 15 |  |  |  |  |  |  |

Loop scheduling without resource bounds


Loop scheduling without resource bounds

| Arrange in tableau |  | 1 | 2 | 3 | 4 | 5 | 6 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 1 | acfj | f $j$ | f j | f j | f j | f j |
|  | 2 | b d |  |  |  |  |  |
|  | 3 | e gh | a |  |  |  |  |
|  | 4 |  | b c |  |  |  |  |
|  | 5 |  |  |  |  |  |  |
|  | 6 |  |  |  |  |  |  |
|  | 7 |  |  |  |  |  |  |
|  | 8 |  |  |  |  |  |  |
|  | 9 |  |  |  |  |  |  |
|  | 10 |  |  |  |  |  |  |
|  | 11 |  |  |  |  |  |  |
|  | 12 |  |  |  |  |  |  |
|  | 13 |  |  |  |  |  |  |
|  | 14 |  |  |  |  |  |  |
|  | 15 |  |  |  |  |  |  |

## Loop scheduling without resource bounds

... some more iterations.


|  | 1 | 2 | 3 | 4 | 5 | 6 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | acfj | f j | f j | f j | f j | f j |
| 2 | b d |  |  |  |  |  |
| 3 | egh | a |  |  |  |  |
| 4 |  | b c |  |  |  |  |
| 5 |  | d g | a |  |  |  |
| 6 |  | eh | b |  |  |  |
| 7 |  |  | c g | a |  |  |
| 8 |  |  | d | b |  |  |
| 9 |  |  | eh | g | a |  |
| 10 |  |  |  | C | b |  |
| 11 |  |  |  | d | g | a |
| 12 |  |  |  | e h |  | b |
| 13 |  |  |  |  | C | g |
| 14 |  |  |  |  | d |  |
| 15 |  |  |  |  | e h |  |

## Loop scheduling without resource bounds

Identify groups of instructions; note gaps


|  | 1 | 2 | 3 | 4 | 5 | 6 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | a $\mathbf{c f} \mathbf{j}$ | f j | f j | f j | f j | f |
| 2 |  |  |  | $\text { pe } 0$ |  |  |
| 3 | e g h |  |  | XX |  |  |
| 4 |  |  |  | : |  |  |
| 5 |  | d |  |  |  |  |
| 6 |  | e $h$ |  |  |  |  |
| 7 |  |  |  |  |  |  |
| 8 |  |  |  |  |  |  |
| 9 |  |  |  |  |  |  |
| 10 |  |  |  |  |  |  |
| 11 |  |  |  |  | g | a |
| 12 |  |  |  |  | XX | b |
| 13 |  |  |  |  | ${ }_{c}$ | g |
| 14 |  |  |  |  | d | XX |
| 15 |  |  |  |  | e h | XX |

Loop scheduling without resource bounds
Close gaps by delaying fast instruction groups



Loop scheduling without resource bounds
Identify "steady state" - of slope 3

prologue


|  | 1 | 2 | 3 | 4 | 5 | 6 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | $a \mathbf{c f j}$ |  |  |  |  |  |
| 2 | b d | f $\mathbf{j}$ |  |  |  |  |
| 3 | egh | a |  |  |  |  |
| 4 |  | b c | f ${ }^{\text {j }}$ |  |  |  |
| 5 |  | d 9 | a |  |  |  |
| 6 |  | eh | b | f j |  |  |
| 7 |  |  | c g | a |  |  |
| 8 |  |  | d | b |  |  |
| 9 |  |  | eh | g | f $\mathbf{j}$ |  |
| 10 |  |  |  | c | a |  |
| 11 |  |  |  | d | b |  |
| 12 |  |  |  | eh | g | f j |
| 13 |  |  |  |  | c | a |
| 14 |  |  |  |  | d | b |
| 15 |  |  |  |  | eh | g |

## Loop scheduling without resource bounds

## Expand instructions

- No cycle has >5 instructions
- Instructions in a row execute in parallel; reads in RHS happen before writes in LHS

1. Prologue - also set up i

| $\mathrm{a}_{1} \leftarrow \mathrm{j}_{0} \diamond \mathrm{~b}_{0}$ | $c_{1} \leftarrow e_{0} \diamond j_{0}$ | $\mathrm{f}_{1} \leftarrow \mathrm{U}[1]$ | $\mathrm{j}_{1} \leftarrow \mathrm{X}[1]$ |  |
| :---: | :---: | :---: | :---: | :---: |
| $\mathrm{b}_{1} \leftarrow \mathrm{a}_{1} \diamond \mathrm{f}_{0}$ | $\mathrm{d}_{1} \leftarrow \mathrm{f}_{0} \diamond \mathrm{c}_{1}$ | $\mathrm{f}_{2} \leftarrow \mathrm{U}[2]$ | $\mathrm{j}_{2} \leftarrow \mathrm{X}[2]$ |  |
| $\mathrm{e}_{1} \leftarrow \mathrm{~b}_{1} \diamond \mathrm{~d}_{1}$ | $\mathbf{V}[1] \leftarrow \mathrm{b}_{1}$ | $\mathrm{W}[1] \leftarrow \mathrm{d}_{1}$ | $\mathrm{a}_{2} \leftarrow \mathrm{j}_{1} \diamond \mathrm{~b}_{1}$ |  |
| $\mathrm{b}_{2} \leftarrow \mathrm{a}_{2} \diamond \mathrm{f}_{1}$ | $c_{2} \leftarrow \mathrm{e}_{1} \diamond \mathrm{j}_{1}$ | $\mathrm{f}_{3} \leftarrow \mathrm{U}[3]$ | $\mathrm{j}_{3} \leftarrow \mathrm{X}[3]$ |  |
| $\mathrm{d}_{2} \leftarrow \mathrm{f}_{1} \diamond \mathrm{c}_{2}$ | $\mathrm{V}[2] \leqslant \mathrm{b}_{2}$ | $\mathrm{a}_{3} \leftarrow \mathrm{j}_{2} \diamond \mathrm{~b}_{2}$ |  |  |
| $\mathrm{e}_{2} \leftarrow \mathrm{~b}_{2} \diamond \mathrm{~d}_{2}$ | $W[2] \leftarrow d_{2}$ | $\mathrm{b}_{3} \leftarrow \mathrm{a}_{3} \diamond \mathrm{f}_{2}$ | $\mathrm{f}_{4} \leftarrow \mathrm{U}[4]$ | $\mathrm{j}_{4} \leftarrow \mathrm{X}[4]$ |
| $c_{3} \leftarrow e_{2} \diamond j_{2}$ | $\mathrm{V}[3] \leftarrow \mathrm{b}_{3}$ | $\mathrm{a}_{4} \leftarrow j_{3} \diamond \mathrm{~b}_{3}$ |  | $i \leftarrow 3$ |


| for $i \leftarrow 1$ to $N$ |
| :--- |
| $a_{i} \leftarrow j_{i-1} \diamond b_{i-1}$ |
| $b_{i} \leftarrow a_{i} \diamond f_{i-1}$ |
| $c_{i} \leftarrow e_{i-1} \diamond j_{i-1}$ |
| $d_{i} \leftarrow f_{i-1} \diamond c_{i}$ |
| $e_{i} \leftarrow b_{i} \diamond d_{i}$ |
| $f_{i} \leftarrow U[i]$ |
| $g_{i}: V[i] \leftarrow b_{i}$ |
| $h: W[i] \leftarrow d_{i}$ |
| $j_{i} \leftarrow X[i]$ |

## Loop scheduling without resource bounds

## Expand instructions

- no cycle has $>5$ instructions
- Instructions in a row execute in parallel; reads in RHS happen before writes in LHS

2. Loop body - also increment counter and insert (modified) exit condition
incorrect index $\mathrm{a}_{\mathbf{i}}$ in MCIML book

| $d_{i} \leftarrow f_{i-1} \diamond c_{i}$ | $\mathrm{b}_{\mathrm{i}+1} \leftarrow \mathrm{a}_{\mathrm{i}+1} \diamond \mathrm{f}_{\mathrm{i}}$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| $\mathrm{e}_{\mathrm{i}} \leftarrow \mathrm{b}_{\mathrm{i}} \diamond \mathrm{d}_{\mathrm{i}}$ | $\mathbf{W}[\mathrm{i}] \leftarrow \mathrm{d}_{\mathbf{i}}$ | $\begin{gathered} \mathrm{V}[\mathrm{i}+1] \\ \leftarrow \mathrm{b}_{\mathrm{i}+1} \end{gathered}$ | $\begin{gathered} f_{i+2} \\ \leftarrow \\ \leftarrow U[i+2] \end{gathered}$ | $\begin{aligned} & j_{i+2} \\ \leftarrow & \mathrm{X}[i+2] \end{aligned}$ |
| $\mathrm{c}_{\mathrm{i}+1} \leftarrow \mathrm{e}_{\mathrm{i}} \diamond \mathrm{j}_{\mathrm{i}}$ | $\underset{\substack{\mathrm{j}_{\mathrm{i}+2}}}{\mathrm{j}_{\mathrm{i}+1}}$ | $i \leqslant i+1$ | if $\mathbf{i}<\mathbf{N}-2$ goto L |  |

As expected, the loop body has one copy of each instruction a-j, plus induction variable update + test

## Loop scheduling without resource bounds

## Expand instructions

- no cycle has >5 instructions
- Instructions in a row execute in parallel; reads in RHS happen before writes in LHS

| 8 | d | b |  |  |
| :---: | :---: | :---: | :---: | :---: |
| 9 | eh | g | f j |  |
| 10 |  | c | a |  |
| 11 |  | d | b |  |
| 12 |  | eh | g | f j |
| 13 |  |  | c | a |

3. Loop epilogue - finish all N iterations

| $\mathbf{d}_{\mathrm{N}-1} \leftarrow$ <br> $\mathbf{f}_{\mathrm{N}-1} \diamond \mathbf{c}_{\mathrm{N}-2}$ | $\mathrm{~b}_{\mathrm{N}} \leftarrow \mathrm{a}_{\mathrm{N}} \diamond \mathrm{f}_{\mathrm{N}-1}$ |  |  |  |
| :--- | :--- | :--- | :--- | :--- |
| $\mathbf{e}_{\mathrm{N}-1} \leftarrow$ <br> $\mathbf{b}_{\mathrm{N}-1} \diamond \mathbf{d}_{\mathrm{N}-1}$ | $\mathrm{W}[\mathrm{N}-1]$ <br> $\leftarrow \mathrm{d}_{\mathrm{N}-1}$ | $\mathrm{~V}[\mathrm{~N}] \leftarrow \mathrm{b}_{\mathrm{N}}$ |  |  |
| $\mathbf{c}_{\mathrm{N}} \leftarrow$ <br> $\mathbf{e}_{\mathrm{N}-1} \diamond \mathbf{j}_{\mathrm{N}-1}$ |  |  |  |  |
| $\mathbf{d}_{\mathrm{N}} \leftarrow \mathbf{f}_{\mathrm{N}} \diamond \mathbf{c}_{\mathrm{N}-1}$ |  |  |  |  |
| $\mathbf{e}_{\mathrm{N}} \leftarrow \mathbf{b}_{\mathrm{N}} \diamond \mathbf{d}_{\mathrm{N}}$ | $\mathbf{W}[\mathrm{N}] \leftarrow \mathbf{d}_{\mathrm{N}}$ |  |  |  |


| for $i \leftarrow 1$ to $N$ |
| :--- |
| $a_{i} \leftarrow j_{i-1} \diamond b_{i-1}$ |
| $b_{i} \leftarrow a_{i} \diamond f_{i-1}$ |
| $c_{i} \leftarrow e_{i-1} \diamond j_{i-1}$ |
| $d_{i} \leftarrow f_{i-1} \diamond c_{i}$ |
| $e_{i} \leftarrow b_{i} \diamond d_{i}$ |
| $f_{i} \leftarrow U[i]$ |
| $g: V[i] \leftarrow b_{i}$ |
| $h: W[i] \leftarrow d_{i}$ |
| $j_{i} \leftarrow x[i]$ |

## Loop scheduling without resource bounds

| $\mathrm{a}_{1} \leftarrow \mathrm{j}_{0} \diamond \mathrm{~b}_{0}$ | $\mathrm{c}_{1} \leftarrow \mathrm{e}_{0} \diamond \mathrm{j}_{0}$ | $\mathrm{f}_{1} \leftarrow \mathrm{U}$ [1] | $\mathrm{j}_{1} \leftarrow \mathrm{X}$ [1] |  |
| :---: | :---: | :---: | :---: | :---: |
| $\mathrm{b}_{1} \leftarrow \mathrm{a}_{1} \diamond \mathrm{f}_{0}$ | $\mathrm{d}_{1} \leftarrow \mathrm{f}_{0} \diamond \mathrm{c}_{1}$ | $\mathrm{f}_{2} \leftarrow \mathrm{U}$ [2] | $\mathrm{j}_{2} \leftarrow \mathrm{X}$ [2] |  |
| $\mathrm{e}_{1} \leftarrow \mathrm{~b}_{1} \diamond \mathrm{~d}_{1}$ | V [1] $\leftarrow \mathrm{b}_{1}$ | $W[1] \leftarrow d_{1}$ | $\mathrm{a}_{2} \leftarrow \mathrm{j}_{1} \diamond \mathrm{~b}_{1}$ |  |
| $\mathrm{b}_{2} \leftarrow \mathrm{a}_{2} \diamond \mathrm{f}_{1}$ | $\mathrm{c}_{2} \leftarrow \mathrm{e}_{1} \diamond \mathrm{j}_{1}$ | $\mathrm{f}_{3} \leftarrow \mathrm{U}$ [3] | $\mathrm{j}_{3} \leftarrow \mathbf{X}[3]$ |  |
| $\mathrm{d}_{2} \leftarrow \mathrm{f}_{1} \diamond \mathrm{c}_{2}$ | V [2] $\leftarrow \mathrm{b}_{2}$ | $\mathrm{a}_{3} \leftarrow \mathrm{j}_{2} \diamond \mathrm{~b}_{2}$ |  |  |
| $\mathrm{e}_{2} \leftarrow \mathrm{~b}_{2} \diamond \mathrm{~d}_{2}$ | $\mathbf{W}[2] \leqslant \mathrm{d}_{\mathbf{2}}$ | $\mathrm{b}_{3} \leftarrow \mathrm{a}_{3} \diamond \mathrm{f}_{2}$ | $\mathrm{f}_{4} \leftarrow \mathrm{U}$ [4] | $\mathrm{j}_{4} \leftarrow \mathrm{X}[4]$ |
| $\mathrm{c}_{3} \leftarrow \mathrm{e}_{2} \diamond \mathrm{j}_{2}$ | $\mathrm{V}[3] \leftarrow \mathrm{b}_{3}$ | $\mathrm{a}_{4} \leftarrow \mathrm{j}_{3} \diamond \mathrm{~b}_{3}$ |  | i $\leqslant 3$ |


| $\mathrm{d}_{\mathrm{i}} \leqslant \mathrm{f}_{\mathrm{i}-1} \diamond \mathrm{c}_{\mathrm{i}}$ | $\mathrm{b}_{\mathrm{i}+1} \leftarrow \mathrm{a}_{\mathrm{i}+1} \diamond \mathrm{f}_{\mathrm{i}}$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| $\mathrm{e}_{\mathrm{i}} \leqslant \mathrm{b}_{\mathrm{i}} \diamond \mathrm{d}_{\mathrm{i}}$ | $\mathbf{W}[\mathbf{i}] \leftarrow \mathrm{d}_{\mathbf{i}}$ | $\mathrm{V}[\mathrm{i}+1] \leftarrow \mathrm{b}_{\text {i+1 }}$ | $\mathrm{f}_{\mathrm{i}+2} \leftarrow \mathrm{U}[\mathrm{i}+2]$ | $\mathrm{j}_{\mathrm{i}+2} \leftarrow \mathrm{X}[\mathrm{i}+2]$ |
| $\mathrm{c}_{\mathrm{i}+1} \leqslant \mathrm{e}_{\mathrm{i}} \diamond \mathrm{j}_{\mathrm{i}}$ | $a_{i+2} \leftarrow j_{i+1} \diamond b_{i+1}$ | $\mathbf{i} \leftarrow \mathbf{i}+\mathbf{1}$ | if $\mathbf{i}<\mathbf{N - 2}$ goto $\mathbf{L}$ |  |


| $\mathrm{d}_{\mathrm{N}-1} \leftarrow \mathrm{f}_{\mathrm{N}-1} \diamond \mathrm{c}_{\mathrm{N}-2}$ | $\mathrm{~b}_{\mathrm{N}} \leftarrow \mathrm{a}_{\mathrm{N}} \diamond \mathrm{f}_{\mathrm{N}-1}$ |  |  |  |
| :--- | :--- | :--- | :--- | :--- |
| $\mathrm{e}_{\mathrm{N}-1} \leftarrow \mathrm{~b}_{\mathrm{N}-1} \diamond \mathrm{~d}_{\mathrm{N}-1}$ | $\mathrm{~W}[\mathrm{~N}-1] \leftarrow \mathrm{d}_{\mathrm{N}-1}$ | $\mathrm{~V}[\mathrm{~N}] \leftarrow \mathrm{b}_{\mathrm{N}}$ |  |  |
| $\mathrm{c}_{\mathrm{N}} \leftarrow \mathrm{e}_{\mathrm{N}-1} \diamond \mathbf{j}_{\mathrm{N}-1}$ |  |  |  |  |
| $\mathrm{~d}_{\mathrm{N}} \leftarrow \mathrm{f}_{\mathrm{N}} \diamond \mathrm{c}_{\mathrm{N}-1}$ |  |  |  |  |
| $\mathrm{e}_{\mathrm{N}} \leftarrow \mathrm{b}_{\mathrm{N}} \diamond \mathrm{d}_{\mathrm{N}}$ | $\mathrm{W}[\mathrm{N}] \leftarrow \mathrm{d}_{\mathrm{N}}$ |  |  |  |

## Loop scheduling without resource bounds

Final step: eliminate indices i from variables - want
"constant" variables/registers in body!

| $\mathrm{d}_{\mathrm{i}} \leftarrow \mathrm{f}_{\mathrm{i}-1} \diamond \mathrm{c}_{\mathrm{i}}$ | $\mathrm{b}_{\mathrm{i}+1} \leftarrow \mathrm{a}_{\mathrm{i}+1} \diamond \mathrm{f}_{\mathrm{i}}$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| $e_{i} \leftarrow b_{i} \diamond d_{i}$ | $\mathbf{W}[\mathrm{i}] \leqslant \mathrm{d}_{\mathbf{i}}$ | $\mathrm{V}[\mathrm{i}+1] \leftarrow \mathrm{b}_{\text {i+1 }}$ | $\mathrm{f}_{\mathrm{i}+2} \leftarrow \mathrm{U}[\mathrm{i}+2]$ | $\mathrm{j}_{\mathrm{i}+2} \times \mathrm{X}[\mathrm{i}+2]$ |
| $\mathrm{c}_{\mathrm{i}+1} \leqslant \mathrm{e}_{\mathrm{i}}$ ( $\mathrm{j}_{\mathrm{i}}$ | $a_{i+2}-j_{i+1} b_{i+1}$ | $i \leqslant i+1$ | if $\mathbf{i}<\mathbf{N - 2}$ goto $\mathbf{L}$ |  |

need 3 copies of j since up to 3 copies are live: $\mathrm{j}_{\mathrm{i}+2} \rightarrow \mathrm{j}, \mathrm{j}_{i+1} \rightarrow \mathrm{j}^{\prime}$, $\mathrm{j}_{\mathrm{i}} \rightarrow \mathrm{j}$ "

## Loop scheduling without resource bounds

Final step: eliminate indices i from variables - want
"constant" variables/registers in body!

| $\mathrm{d}_{\mathrm{i}} \leqslant \mathrm{f}_{\mathrm{i}-1} \diamond \mathrm{c}_{\mathrm{i}}$ | $b_{i+1} \leftarrow a_{i+1} \diamond f_{i}$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| $e_{i} \leftarrow b_{i} \diamond d_{i}$ | $\mathbf{W}[\mathrm{i}] \leftarrow \mathrm{d}_{\mathbf{i}}$ | $\mathrm{V}[\mathrm{i}+1] \leqslant \mathrm{b}_{\text {i }+1}$ | $\mathrm{f}_{\mathrm{i}+2} \leftarrow U[i+2]$ | $\mathrm{j}_{\mathrm{i}+2} \times X[i+2]$ |
| $c_{i+1} \leqslant e_{i}$ ( $j_{i}$ | $a_{i+2}-j_{i+1} b_{i+1}$ | $\mathbf{i} \leqslant \mathbf{i}+1$ | if $\mathbf{i}<\mathbf{N - 2}$ goto $\mathbf{L}$ |  |

need 3 copies of j since up to 3 copies are live: $\mathrm{j}_{\mathrm{j}+2} \rightarrow \mathrm{j}, \mathrm{j}_{\mathrm{i}+1} \rightarrow \mathrm{j}^{\prime}, \mathrm{j}_{\mathrm{i}} \rightarrow \mathrm{j}$ "

| $\mathrm{d}_{\mathrm{i}} \leftarrow \mathrm{f}_{\mathrm{i}-1} \diamond \mathrm{c}_{\mathrm{i}}$ | $\mathrm{b}_{\mathrm{i}+1} \leftarrow \mathrm{a}_{\mathrm{i}+1} \diamond \mathrm{f}_{\mathrm{i}}$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| $\mathrm{e}_{\mathrm{i}} \leqslant \mathrm{b}_{\mathrm{i}} \diamond \mathrm{d}_{\mathrm{i}}$ | $\mathrm{W}[\mathrm{i}] \leftarrow \mathrm{d}_{\mathrm{i}}$ | $\mathrm{V}[\mathrm{i}+1] \leftarrow \mathrm{b}_{\mathrm{i}+1}$ | $\mathrm{f}_{\mathrm{i}+2} \leftarrow U[i+2]$ | j $\quad \times[i+2]$ |
| $\mathrm{c}_{\mathrm{i}+1} \leftarrow \mathrm{e}_{\mathrm{i}} \mathrm{j}^{\prime \prime}$ | $a_{i+2}\left\langle j^{r} \diamond \dot{j}_{i+1}\right.$ | $i \leqslant i+1$ | if $\mathbf{i}<\mathbf{N - 2}$ goto $\mathbf{L}$ |  |

## Loop scheduling without resource bounds

Final step: eliminate indices i from variables - want "constant" variables/registers in body!

| $\mathrm{d}_{\mathrm{i}} \leqslant \mathrm{f}_{\mathrm{i}-1} \diamond \mathrm{c}_{\mathrm{i}}$ | $\mathrm{b}_{\mathrm{i}+1} \leftarrow \mathrm{a}_{\mathrm{i}+1} \diamond \mathrm{f}_{\mathrm{i}}$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
| $\mathrm{e}_{\mathrm{i}} \leftarrow \mathrm{b}_{\mathrm{i}} \diamond \mathrm{d}_{\mathrm{i}}$ | $\mathrm{W}[\mathrm{i}] \leftarrow \mathrm{d}_{\mathrm{i}}$ | $\mathrm{V}[\mathrm{i}+1] \leqslant \mathrm{b}_{\text {i+1 }}$ | $\mathrm{f}_{\mathrm{i}+2} \leftarrow \mathrm{U}[\mathrm{i}+2]$ | $\mathrm{j}_{\mathrm{i}+2} \times \times \mathrm{X}$ (i+2] |
| $c_{i+1} \leqslant e_{i}<j_{i}$ | $a_{i+2}-j_{i+1} b_{i+1}$ | $i<i+1$ | if $\mathbf{i}<\mathbf{N - 2}$ goto $\mathbf{L}$ |  |

need 3 copies of j since up to 3 copies are live: $\mathrm{j}_{\mathrm{i}+2} \rightarrow \mathrm{j}, \mathrm{j}_{\mathrm{i}+1} \rightarrow \mathrm{j}^{\prime}, \mathrm{j}_{\mathrm{i}} \rightarrow \mathrm{j}$ "

| $\mathrm{d}_{\mathrm{i}} \leftarrow \mathrm{f}_{\mathrm{i}-1} \diamond \mathrm{c}_{\mathrm{i}}$ | $\mathrm{b}_{\mathrm{i}+1} \leftarrow \mathrm{a}_{\mathrm{i}+1} \diamond \mathrm{f}_{\mathrm{i}}$ | $\mathrm{j}^{\prime \prime} \leqslant \mathrm{j}^{\prime}$ | $\mathrm{j}^{\prime} \leqslant$ j |  |
| :---: | :---: | :---: | :---: | :---: |
| $e_{i} \leftarrow b_{i} \diamond d_{i}$ | $\mathbf{W}[\mathrm{i}] \leqslant \mathrm{d}_{\mathrm{i}}$ | $\mathrm{V}[\mathrm{i}+1] \leqslant \mathrm{b}_{\text {i+1 }}$ | $\mathrm{f}_{\mathrm{i}+2} \leqslant \mathrm{U}[\mathrm{i}+2]$ | j $\times[i+2]$ |
| $\mathrm{c}_{\mathrm{i}+1} \leqslant \mathrm{e}_{\mathrm{i}}\left(\mathrm{j}^{\prime \prime}\right.$ | $\left.a_{i+2} \mathrm{j}^{\prime}\right\rangle{ }^{\text {i+1 }}$ | $i<i+1$ | if $\mathbf{i}<\mathbf{N - 2}$ goto $\mathbf{L}$ |  |

- the copies live across an iteration need to be updated in each iteration.
- also, need to initialize the live-in copies of the loop at the end of prologue ( $j, j$ ' $)$
- also, can replace the indexed live-in copies of the epilogue with primed versions
- all this for all variables a, ..j (see book - modulo typo regarding a, a')


## Loop scheduling without resource bounds

## Summary of main steps

1. calculate data dependence graph of unrolled loop
2. schedule each instruction from each loop as early as possible
3. plot the tableau of iterations versus cycles
4. identify groups of instructions, and their slopes
5. coalesce the slopes by slowing down fast instruction groups
6. identify steady state, and loop prologue and epilogue
7. reroll the loop, removing the iteration-indexed variable names

## Loop scheduling with resource bounds

## Input:

- data dependences of loop, with latency annotations
- resource requirements of all instruction forms:

| ADD | FETCH | READ | UNPACK | SHIFT | ROUND | ROUND | WRITE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | ADD | ADD | SHIFT |  |  |  |

MUL | FETCH | READ | UNPACK | MULTA | MULTA | MULTA | MULTB | MULTB | ROUND | WRITE |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

- \#available Functional Units of each type, and descriptions of FU types:
- \# of instructions that can be issued in one cycle,
- restrictions which instruction forms can be issued simultaneously etc


## Loop scheduling with resource bounds

## Input:

- data dependences of loop, with latency annotations
- resource requirements of all instruction forms:

ADD | FETCH | READ | UNPACK | SHIFT | ROUND | ROUND | WRITE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  | ADD | SHIFT |  |  |

| FETCH | READ | UNPACK | MULTA | MULTA | MULTA | MULTB | MULTB | ROUND | WRITE |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  |  |  |  | ADD |  |  |

- \#available Functional Units of each type, and descriptions of FU types:
- \# of instructions that can be issued in one cycle,
- restrictions which instruction forms can be issued simultaneously etc

Modulo scheduling:

- find schedule that satisfies resource and (data) dependency requirements; then do register allocation
- try to schedule loop body using $\Delta$ cycles, for $\Delta=\Delta_{\min }, \Delta_{\min }+1, \Delta_{\min }+2 \ldots$
- body surrounded by prologue and epilogue as before


## Modulo scheduling: where's the mod?

Observation: if resource constraints prevent an instruction from being scheduled at time $t$, they also prevent $t$ from being scheduled at times $t+\Delta, t+2 \Delta, \ldots$ or indeed any $t^{\prime}$ with $t=t^{\prime}$ mod $\Delta$.
Example: $\Delta=3$, machine can only execute 1 load instruction at a

```
fori<1 to N
    a
    bi}<\mp@subsup{a}{i}{}\diamond\mp@subsup{f}{i-1}{
    c
    di}<\mp@subsup{f}{i-1}{}\diamond\mp@subsup{c}{i}{
    e
    fi}<U[i
    g:V [i]< bi
    h:W[i]&di
    ji}< X[i
``` time, loop body from previous example
\begin{tabular}{|l|l|l|}
\hline 0 & & \\
\hline 1 & \(\mathrm{f}_{\mathrm{i}} \leftarrow U[\mathrm{i}]\) & \(\mathrm{j}_{\mathrm{i}} \leftarrow \mathrm{X}[\mathrm{i}]\) \\
\hline 2 & & \\
\hline
\end{tabular} \(\sim \sim\)

\section*{Modulo scheduling: where's the mod?}

Observation: if resource constraints prevent an instruction from being scheduled at time \(t\), they also prevent \(t\) from being scheduled at times \(t+\Delta, t+2 \Delta, \ldots\) or indeed any \(t^{\prime}\) with \(t=t^{\prime}\) mod \(\Delta\).

Example: \(\Delta=3\), machine can only execute 1 load instruction at a
```

fori< to N
a
bi}<\mp@subsup{a}{i}{}\diamond\mp@subsup{f}{i-1}{
c
di}\leftarrow\mp@subsup{f}{i.1}{}\diamond\mp@subsup{c}{i}{
e
fi}<U[i
g:V [i]< bi
h:W[i]\& \& i
ji}\leftarrowX[i

``` time, loop body from previous example
\begin{tabular}{|l|l|l|}
\hline 0 & & \\
\hline 1 & \(\mathrm{f}_{\mathrm{i}} \leftarrow U[\mathrm{i}]\) & \(\mathrm{j}_{\mathrm{i}} \leftarrow X[\mathrm{i}]\) \\
\hline 2 & & \\
\hline
\end{tabular}
\begin{tabular}{|l|l|l|}
\hline 0 & \(f_{i} \leftarrow U[\mathrm{i}]\) & \\
\hline 1 & & \(\mathrm{j}_{\mathrm{i}} \leftarrow \mathrm{X}[\mathrm{i}]\) \\
\hline 2 & & \\
\hline
\end{tabular}
\begin{tabular}{|l|l|l|}
\hline 0 & & \\
\hline 1 & & \(\mathrm{j}_{\mathrm{i}} \leftarrow X[\mathrm{i}]\) \\
\hline 2 & \(\mathrm{f}_{\mathrm{i}} \leftarrow U[\mathrm{i}]\) & \\
\hline
\end{tabular}

\section*{Modulo scheduling: where's the mod?}

Observation: if resource constraints prevent an instruction from being scheduled at time \(t\), they also prevent \(t\) from being scheduled at times \(t+\Delta, t+2 \Delta, \ldots\) or indeed any \(t^{\prime}\) with \(t=t^{\prime}\) mod \(\Delta\).

Example: \(\Delta=3\), machine can only execute 1 load instruction at a
```

fori< l to N
a
bi}<\mp@subsup{a}{i}{}\diamond\mp@subsup{f}{i-1}{
c
di}\leftarrow\mp@subsup{f}{i.1}{}\diamond\mp@subsup{c}{i}{
e
fi}\leftarrowU[i
g:V Vi]< bi
h:W[i]\&di
ji}< < [i

``` time, loop body from previous example
\begin{tabular}{|l|l|l|}
\hline 0 & & \\
\hline 1 & \(\mathrm{f}_{\mathrm{i}} \leftarrow U[\mathrm{i}]\) & \(\mathrm{j}_{\mathrm{i}} \leftarrow X[\mathrm{i}]\) \\
\hline 2 & & \\
\hline
\end{tabular}

\begin{tabular}{|l|l|l|}
\hline 0 & \(\mathrm{f}_{\mathrm{i}} \leftarrow U[\mathrm{i}]\) & \\
\hline 1 & & \(\mathrm{j}_{\mathrm{i}} \leftarrow \mathrm{X}[\mathrm{i}]\) \\
\hline 2 & & \\
\hline
\end{tabular}
\begin{tabular}{|l|l|l|}
\hline 0 & & \\
\hline 1 & & \(\mathrm{j}_{\mathrm{i}} \leftarrow X[\mathrm{i}]\) \\
\hline 2 & \(\mathrm{f}_{\mathrm{i}} \leftarrow U[\mathrm{i}]\) & \\
\hline
\end{tabular}
\begin{tabular}{|l|l|l|}
\hline \(0=3\) & \(\mathrm{f}_{\mathrm{i}-1} \leftarrow U[\mathrm{i}]\) & \\
\hline 1 & & \(\mathrm{j}_{\mathrm{i}} \leftarrow \mathrm{X}[\mathrm{i}]\) \\
\hline 2 & & \\
\hline 0 & & \\
\hline 1 & & \(\mathrm{j}_{\mathrm{i}} \leftarrow \mathrm{X}[\mathrm{i}]\) \\
\hline \(2=-1\) & \(\mathrm{f}_{\mathrm{i}+1} \leftarrow U[\mathrm{i}]\) & \\
\hline
\end{tabular}

\section*{Modulo scheduling: where's the mod?}

Observation: if resource constraints prevent an instruction from being scheduled at time \(t\), they also prevent \(t\) from being scheduled at times \(\mathrm{t}+\Delta, \mathrm{t}+2 \Delta, \ldots\) or indeed any \(\mathrm{t}^{\prime}\) with \(\mathrm{t}=\mathrm{t}^{\prime}\) mod \(\Delta\).
Example: \(\Delta=3\), machine can only execute 1 load instruction at a time, loop body from previous example
\begin{tabular}{|l|l|l|}
\hline 0 & & \\
\hline 1 & \(f_{i} \leftarrow U[i]\) & \(\mathrm{j}_{\mathrm{i}} \leftarrow \mathrm{X}[\mathrm{i}]\) \\
\hline 2 & & \\
\hline
\end{tabular}

\begin{tabular}{|l|l|l|}
\hline 0 & & \\
\hline \(1=4\) & \(f_{i-1} \leftarrow U[i-1]\) & \(j_{i} \leftarrow X[i]\) \\
\hline 2 & & \\
\hline
\end{tabular}
\begin{tabular}{|l|l|l|}
\hline 0 & \(\mathrm{f}_{\mathrm{i}} \leftarrow U[\mathrm{i}]\) & \\
\hline 1 & & \(\mathrm{j}_{\mathrm{i}} \leftarrow \mathrm{X}[\mathrm{i}]\) \\
\hline 2 & & \\
\hline
\end{tabular}
\begin{tabular}{|l|l|l|}
\hline 0 & & \\
\hline 1 & & \(\mathrm{j}_{\mathrm{i}} \leftarrow X[\mathrm{i}]\) \\
\hline 2 & \(\mathrm{f}_{\mathrm{i}} \leftarrow U[\mathrm{i}]\) & \\
\hline
\end{tabular}
\begin{tabular}{|l|l|l|}
\hline \(0=3\) & \(\mathrm{f}_{\mathrm{i}-1} \leftarrow U[\mathrm{i}]\) & \\
\hline 1 & & \(\mathrm{j}_{\mathrm{i}} \leftarrow \mathrm{X}[\mathrm{i}]\) \\
\hline 2 & & \\
\hline 0 & & \\
\hline 1 & & \(\mathrm{j}_{\mathrm{i}} \leftarrow \mathrm{X}[\mathrm{i}]\) \\
\hline \(2=-1\) & \(\mathrm{f}_{\mathrm{i}+1} \leftarrow U[\mathrm{i}]\) & \\
\hline
\end{tabular}

\section*{Modulo scheduling}

Interaction with register allocation:
- delaying an instruction \(\mathrm{d}: \mathrm{z} \leftarrow \mathrm{x}\) op y
- extends the liveness-range of d's uses, namely \(x\) and \(y\); may overlap with other (iteration count-indexed) versions of \(z\), so may need to maintain multiple copies, as in previous example
- shortens liveness range of the \(\operatorname{def}(\mathbf{s})\) of d, namely, \(z\), to its uses; range < 1 illegal; ie need to postpone uses, too
- similarly, schedudling an instruction earlier shortens the liveness ranges of its uses and extends the liveness range of its defs
- hence, scheduling affects liveness/register allocation

\section*{Modulo scheduling: estimating \(\Delta_{\min }\)}

Identification of \(\Delta_{\min }\) as the maximum of the following:
- resource estimator: for each FU
- calculate requested cycles: add cycle requests of all instructions mapped to that FU
- divide request by number of instances of the FU type
- max over all FU types is lower bound on \(\Delta_{\max }\)
- data-dependence estimator: sum of latencies along a simple cycle through the data dependence graph

\section*{Modulo scheduling: estimating \(\Delta_{\text {min }}\)}

\section*{Identification of \(\Delta_{\min }\) as the maximum of the following:}
- resource estimator: for each FU
- calculate requested cycles: add cycle requests of all instructions mapped to that FU
- divide request by number of instances of the FU type
- max over all FU types is lower bound on \(\Delta_{\max }\)
- data-dependence estimator: sum of latencies along a simple cycle through the data dependence graph

Example: 1 ALU, 1 MEM; both issue 1 instruction/cycle; instr. latency 1 cycle

(MEM instructions in box)

Data dependence estimator: \(3(\mathrm{c} \rightarrow \mathrm{d} \rightarrow \mathrm{e} \rightarrow \mathrm{c}\) )
ALU-estimator: 5 instrs, 1 cycle each, 1 ALU \(\rightarrow 5\) MEM-estimator: 4 instrs, 1 cycle each, 1 MEM \(\rightarrow 4\)
\[
\text { Hence } \Delta_{\text {min }}=5
\]

\section*{Modulo scheduling: priority of instructions}

Algorithm schedules instructions according to priorities
Possible metrics:
- membership in data dependence cycle of max latency
- execution on FU type that's most heavily used (resource estimate)

Example: [c, d, e, a, b, f, j, g, h]


\section*{Modulo scheduling: sketch of algorithm}

Main data structures:
- array SchedTime, assigning to each instruction a cycle time
- table ResourceMap, assigning to each
\begin{tabular}{|c|c|}
\hline Instr 1 & 8 \\
\hline Instr 2 & 4 \\
\hline Instr 3 & 0 \\
\hline\(:\) & \(:\) \\
\hline
\end{tabular}
\begin{tabular}{|c|c|c|}
\hline & FU1 & FU2 \\
\hline 0 & Instr 1 & Instr 4 \\
\hline 1 & Instr 2 & \\
\hline 2 & & Instr 3 \\
\hline\(:\) & \(:\) & \(:\) \\
\hline
\end{tabular}
- pick highest-priority instruction that's not yet scheduled: \(\mathbf{i}\)
- schedule i at earliest cycle that
- respects the data dependencies w.r.t. the already scheduled instructions
- has the right FU for i available
- if i can't be scheduled for current \(\Delta\), place i without respecting resource constraint: evict current inhabitant and/or data-dependence successors of i hat are now scheduled too early. Evictees need to scheduled again.
- in principle evictions could go on forever
- define a cut-off (heuristics) at which point \(\Delta\) is increased

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline\(a\) & \\
\hline\(b\) & \\
\hline c & \\
\hline d & \\
\hline e & \\
\hline f & \\
\hline g & \\
\hline\(h\) & \\
\hline\(j\) & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: c
- earliest cycle with free ALU s.t. data-deps w.r.t. scheduled instructions are respected: 0

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline\(a\) & \\
\hline\(b\) & \\
\hline c & 0 \\
\hline d & \\
\hline e & \\
\hline f & \\
\hline g & \\
\hline\(h\) & \\
\hline\(j\) & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: c
- earliest cycle with free ALU s.t. data-deps w.r.t. scheduled instructions are respected: 0
- so schedule cin cycle 0

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline\(a\) & 3 \\
\hline\(b\) & \\
\hline\(c\) & 0 \\
\hline\(d\) & 1 \\
\hline\(e\) & 2 \\
\hline\(f\) & \\
\hline\(g\) & \\
\hline\(h\) & \\
\hline\(j\) & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: d
- earliest cycle with free ALU s.t. data-deps w.r.t. scheduled instructions are respected: 1
- so schedule d in cycle 1

Similarly: \(\mathrm{e} \rightarrow 2, \mathrm{a} \rightarrow 3\). Next instruction: b

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline\(a\) & 3 \\
\hline\(b\) & \\
\hline\(c\) & 0 \\
\hline\(d\) & 1 \\
\hline\(e\) & 2 \\
\hline\(f\) & \\
\hline\(g\) & \\
\hline\(h\) & \\
\hline\(j\) & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: d
- earliest cycle with free ALU s.t. data-deps w.r.t. scheduled instructions are respected: 1
- so schedule din cycle 1

Similarly: e \(\rightarrow 2\), \(a \rightarrow 3\). Next instruction: \(b\)
Earliest cycle in which ALU is available: 4. But: b's successor e is scheduled in (earlier) cycle 2! Hence: place b in cycle 4, but evict e.

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline a & 3 \\
\hline\(b\) & 4 \\
\hline c & 0 \\
\hline d & 1 \\
\hline e & ł \\
\hline f & \\
\hline g & \\
\hline h & \\
\hline j & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: d
- earliest cycle with free ALU s.t. data-deps w.r.t. scheduled instructions are respected: 1
- so schedule din cycle 1

Similarly: e \(\rightarrow 2\), \(a \rightarrow 3\). Next instruction: \(b\)
Earliest cycle in which ALU is available: 4. But: b's successor e is scheduled in (earlier) cycle 2! Hence: place b in cycle 4, but evict e.

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline a & 3 \\
\hline b & 4 \\
\hline c & 0 \\
\hline d & 1 \\
\hline e & \(\mathfrak{Z}\) \\
\hline f & \\
\hline g & \\
\hline h & \\
\hline j & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: e
- ALU-slot for e: 2 (again)
- But: data dependence e \(\rightarrow \mathrm{c}\) violated yes, cross iteration deps count!
So: schedule e in cycle \(7(=2 \bmod \Delta)\), but evict c see next slide...

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline a & 3 \\
\hline b & 4 \\
\hline c & \(\not \emptyset\) \\
\hline d & 1 \\
\hline e & \(\nexists 7\) \\
\hline f & \\
\hline g & \\
\hline h & \\
\hline j & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: c
- ALU-slot for c: 0 (again)
- But: data dependence \(c \rightarrow d\) violated So, schedule c in cycle \(5(=0 \bmod \Delta)\), but evict \(d-\) see next slide...

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline a & 3 \\
\hline b & 4 \\
\hline c & \(\nsupseteq 5\) \\
\hline d & \(\not \neq\) \\
\hline e & \(\not \not 7\) \\
\hline f & \\
\hline g & \\
\hline h & \\
\hline j & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: d
- ALU-slot for d: 1 (again)
- Hooray - data dependence \(d \rightarrow\) e respected So, schedule \(d\) in cycle \(6(=1 \bmod \Delta)\). No eviction see next slide...

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline a & 3 \\
\hline\(b\) & 4 \\
\hline c & \(\nsupseteq 5\) \\
\hline d & \(\not 16\) \\
\hline e & \(\not Z 7\) \\
\hline f & \\
\hline g & \\
\hline h & \\
\hline j & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: \(f\)
- MEM-slot for f: 0; no data-deps, so schedule f:0

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline a & 3 \\
\hline b & 4 \\
\hline c & \(\not 25\) \\
\hline d & \(\neq 6\) \\
\hline e & \(\not 27\) \\
f & 0 \\
\hline g & \\
\hline h & \\
\hline j & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: \(f\)
- MEM-slot for f: 0 ; no data-deps, so schedule f:0
- highest-priority, unscheduled instruction: j
- MEM-slot for j: 1; no data-deps, so schedule j:1
- highest-priority, unscheduled instruction: g
- MEM-slot for \(\mathrm{g}: 2\); and earliest cycle \(\mathrm{c}=2+\mathrm{k}^{*} \Delta\) where data-dep \(\mathrm{b} \rightarrow \mathrm{g}\) is respected is 7 .
So schedule g:7-see next slide...

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline a & 3 \\
\hline b & 4 \\
\hline c & \(\not 又 5\) \\
\hline d & \(\not 16\) \\
\hline e & \(\not \mathbf{} 7\) \\
\hline f & 0 \\
\hline g & 7 \\
\hline h & \\
\hline j & 1 \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: \(h\)
- MEM-slot for h: 3; earliest cycle \(c=3+k^{*} \Delta\) where data-dep \(d \rightarrow h\) is respected is 8 .
So schedule h:8 - final schedule on next slide.

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline a & 3 \\
\hline\(b\) & 4 \\
\hline c & 5 \\
\hline d & 6 \\
\hline e & 7 \\
\hline f & 0 \\
\hline g & 7 \\
\hline\(h\) & 8 \\
\hline j & 1 \\
\hline
\end{tabular}

Instructions c, d, e, g, h are scheduled 1 iteration off.

\section*{Modulo scheduling: example}

\begin{tabular}{|l|l|}
\hline a & 3 \\
\hline b & 4 \\
\hline c & \(\not 05\) \\
\hline d & \(\not \neq\) \\
\hline e & \(\not Z 7\) \\
\hline f & \\
\hline g & \\
\hline h & \\
\hline j & \\
\hline
\end{tabular}
- highest-priority, unscheduled instruction: c
- ALU-slot for c: 0 (again)
- But: data dependence \(\mathrm{c} \rightarrow \mathrm{d}\) violated So, schedule c in cycle \(5(=0 \bmod \Delta)\), but evict \(\mathrm{d} . .\).

\section*{Summary of scheduling}

Challenges arise from interaction between
- program properties: data dependencies (RAW, WAR, WAW) and control dependencies
- hardware constraints (FU availability, latencies, ...)

Optimal solutions typically infeasible \(\rightarrow\) heuristics

Scheduling within a basic block (local): list scheduling
```

x =
y =
M[z] =

```


\section*{Scheduling across basic blocks (global): trace scheduling}

Loop scheduling: SW pipelining, modulo scheduling
```

