# **Topic 14: Scheduling**

# COS 320

# **Compiling Techniques**

Princeton University Spring 2016

Lennart Beringer



#### The Back End:

- 1. Maps infinite number of virtual registers to finite number of real registers  $\rightarrow$  register allocation
- 2. Removes inefficiencies introduced by front-end  $\rightarrow optimizer$
- 3. Removes inefficiencies introduced by programmer  $\rightarrow optimizer$
- 4. Adjusts pseudo-assembly composition and order to match target machine  $\rightarrow$  *sched*-*uler*

#### Starting point

1 r1 = r0 + 0 2 r2 = M[FP + A] 3 r3 = r0 + 4 4 r4 = M[FP + X]

LOOP:

- 1 r5 = r3 \* r1
- 2
- 3 r5 = r2 + r5
- 4 M[r5] = r4
- 5 r1 = r1 + 1
- 6 BR r1 <= 10, LOOP

#### Starting point

1 r1 = r0 + 0 2 r2 = M[FP + A] 3 r3 = r0 + 4 4 r4 = M[FP + X]

#### LOOP:

2

- 1 r5 = r3 \* r1
- 3 r5 = r2 + r5
- 4 M[r5] = r4
- 5 r1 = r1 + 1
- 6 BR r1 <= 10, LOOP

Multiplication takes 2 cycles

#### Instructions take multiple cycles: fill empty slots with independent instructions!

| 1 r1 = r0 + 0<br>2 r2 = M[FP + A]<br>3 r3 = r0 + 4<br>4 r4 = M[FP + X] | 1 $r1 = r0 + 0$<br>2 $r2 = M[FP + A]$<br>3 $r3 = r0 + 4$<br>4 $r4 = M[FP + X]$ |
|------------------------------------------------------------------------|--------------------------------------------------------------------------------|
| LOOP:<br>1 r5 = r3 * r1<br>2                                           | LOOP:<br>1 r5 = r3 * r1                                                        |
| 3 r5 = r2 + r5                                                         | 2 $r1 = r1 + 1$<br>3 $r5 = r2 + r5$                                            |
| 4 M[r5] = r4                                                           | 4 M[r5] = r4                                                                   |
| 5 r1 = r1 + 1                                                          |                                                                                |
| 6 BR r1 <= 10, LOOP                                                    | 5 BR r1 <= 10, LOOP                                                            |

#### Instructions take multiple cycles: fill empty slots with independent instructions!

| 1      |                | what exactly do we mean by "independent"? |
|--------|----------------|-------------------------------------------|
| $\bot$ | r1 = r0 + 0    | 1 r1 = r0 + 0                             |
| 2      | r2 = M[FP + A] | $2 r^2 = M[FP + A]$                       |
| 3      | r3 = r0 + 4    |                                           |
| 4      | r4 = M[FP + X] | 3 r3 = r0 + 4                             |
| -      |                | 4 r4 = M[FP + X]                          |



## Motivating example

When our processor can execute 2 instructions per cycle

1 r1 = r0 + 0 2 r2 = M[FP + A] 3 r3 = r0 + 4 4 r4 = M[FP + X]

LOOP:

1 r5 = r3 \* r1
2 r1 = r1 + 1
3 r5 = r2 + r5
4 M[r5] = r4
5 BR r1 <= 10, LOOP</pre>

## Motivating example

<u>When our processor can execute 2 instructions per cycle:</u> issue pairs of independent instructions whenever possible

| 1<br>2<br>3<br>4 | r2 = M[FP + A]        | 1 r1 = r0 + 0 r2 = M[FP + A]<br>2 r3 = r0 + 4 r4 = M[FP + X] |
|------------------|-----------------------|--------------------------------------------------------------|
| 1                | LOOP:<br>r5 = r3 * r1 | LOOP:<br>1 r5 = r3 * r1 r1 = r1 + 1                          |
| ∠<br>3<br>4<br>5 | M[r5] = r4            | 2<br>3 r5 = r2 + r5<br>4 M[r5] = r4 BR r1 <= 10, LOOP        |

## Motivating example



## **Instruction Level Parallelism**

- Instruction-Level Parallelism (ILP), the concurrent execution of independent assembly instructions. The concurrently executed instructions stem from a single program.
- ILP is a cost effective way to extract performance from programs.
- Exploiting ILP requires global optimization and scheduling.
- Processors can execute several instructions per cycle (Ithanium: up to 6)
- ILP/VLIW: dependencies identified by compiler  $\rightarrow$  instruction bundles
- Super-Scalar: dependencies identified by processor (instruction windows) Advantages / Disadvantages?

- Instruction-Level Parallelism (ILP), the concurrent execution of independent assembly instructions. The concurrently executed instructions stem from a single program.
- ILP is a cost effective way to extract performance from programs.
- Exploiting ILP requires global optimization and scheduling.
- Processors can execute several instructions per cycle (Ithanium: up to 6)
- ILP/VLIW: dependencies identified by compiler  $\rightarrow$  instruction bundles
- Super-Scalar: dependencies identified by processor (instruction windows) Advantages / Disadvantages?

Possible synthesis:

- have compiler take care of register-carried dependencies
- let processor take care of memory-carried dependencies: exploit dynamic resolution of memory aliasing
- use register renaming, register bypassing, out-of-order execution, speculation (branch prediction) to keep all execution units busy

- ordering between instructions that arises from the flow of data
- Control dependencies
  - ordering between instructions that arises from flow of control

### Resource constraints

- processors have limited number of functional units
- not all functional units can execute all instructions (Floating point unit versus Integer-ALU, ...)
- only limited number of instructions can be issued in one cycle
- only a limited number of register read/writes can be done concurrently

- A *data dependence* is a constraint on scheduling arising from the flow of data between two instructions. Types:
  - RAW: An instruction u is *flow-dependent* on a preceding instruction d if u consumes a value computed by d.



- A *data dependence* is a constraint on scheduling arising from the flow of data between two instructions. Types:
  - RAW: An instruction u is *flow-dependent* on a preceding instruction d if u consumes a value computed by d.
  - WAR: An instruction d is *anti-dependent* on a preceding instruction u if d writes to a location read by u.



- A *data dependence* is a constraint on scheduling arising from the flow of data between two instructions. Types:
  - RAW: An instruction u is *flow-dependent* on a preceding instruction d if u consumes a value computed by d.
  - WAR: An instruction d is *anti-dependent* on a preceding instruction u if d writes to a location read by u.
  - WAW: An instruction  $d_2$  is *output-dependent* on a preceding instruction  $d_1$  if  $d_1$  writes to a location also written by  $d_2$ .



- A *data dependence* is a constraint on scheduling arising from the flow of data between two instructions. Types:
  - RAW: An instruction u is *flow-dependent* on a preceding instruction d if u consumes a value computed by d.
  - WAR: An instruction d is *anti-dependent* on a preceding instruction u if d writes to a location read by u.
  - WAW: An instruction  $d_2$  is *output-dependent* on a preceding instruction  $d_1$  if  $d_1$  writes to a location also written by  $d_2$ .



"False"/"name" dependences: arise from reuse of location; can often be avoided by (dynamic) renaming

## **Eliminating false dependencies**

WAW and WAR dependencies can often eliminated by register renaming...



... at the cost of adding registers...

## **Eliminating false dependencies**

WAW and WAR dependencies can often eliminated by register renaming...



... at the cost of adding registers...

### Eliminating false dependences

WAR dependencies can often be replaced by RAW dependencies



... at the price of using yet another register, and a (move) instruction ....

### Eliminating false dependences

WAR dependencies can often be replaced by RAW dependencies



... at the price of using yet another register, and a (move) instruction ....

### Eliminating false dependences

WAR dependencies can often be replaced by RAW dependencies



... at the price of using yet another register, and a (move) instruction ....

## **Control Dependence**

Node y is control dependent on x if

- **x** is a branch, with successors **u**, **v**
- y post-dominates u in the CFG: each path from u to EXIT includes y
- y does not post-dominate v in the CFG: there is a path from v to
   EXIT that avoids y

Schedule must respect control dependences: don't move instructions past their control dependence ancestors!



### Dependences

#### Latency

- Amount of time after the execution of an instruction that its result is ready.
- An instruction can have more than one latency! eg load, depending on cache-hit/miss

#### Data Dependence Graph

- A *data dependence graph* consists of instructions and a set of directed data dependence edges among them in which each edge is labeled with its latency and type of dependence.
- Scheduling (code motion) must respect dependence graph.

Program dependence graph: overlay of data dependence graph with control dependencies (two kinds of edges)

#### Machines can also do scheduling...

- hardware schedulers process code after it has been fetched
- hardware finds independent instructions
- works with legacy architectures (found in x86 / Pentium)
- program knowledge more precise at run-time memory dependence

- control flow resolved

#### But compiler still important.

- Hardware schedulers have a small window.
- Hardware complexity increases.
- Hardware does not benefit directly from compiler optimization.

# **RISC-style processor pipeline**



#### Modern processors:

- many more stages (up to 20-30)
- different stages take different number of cycles per instruction
- some (components of) stages duplicated, eg super-scalar

#### Common characteristics: resource constraints

- each stage can only hold a fixed number of instruction per cycle
- but: instructions can be in-flight concurrently (pipeline more later)
- register bank can only serve small number of reads/writes per cycle

# Goal of scheduling

Construct a sorted version of the dependence graph that

- produces the same result as the sequential program: respect dependencies, latencies
- obeys the resource constrains
- minimizes execution time (other metrics possible)

# Goal of scheduling

Construct a sorted version of the dependence graph that

- produces the same result as the sequential program: respect dependencies, latencies
- obeys the resource constrains
- minimizes execution time (other metrics possible)

Solution formulated as a table that indicates the issue cycle of each instruction:

| Cycle | Resoure 1 | Resource 2 | <br>Resource n |
|-------|-----------|------------|----------------|
| 1     | 1         |            | 2              |
| 2     |           | 3          | 4              |
| 3     |           |            |                |
| •     |           |            |                |

Even simplified version of the scheduling problem are typically NP-hard  $\rightarrow$  heuristics

# A classification of scheduling heuristics

Schedule within a basic block (local)

- instructions cannot move past basic block boundaries
  - schedule covers only one basic block

Example technique: (priority) list scheduling



# A classification of scheduling heuristics

Schedule within a basic block (local)

- instructions cannot move past basic block boundaries
  - schedule covers only one basic block Example technique: (priority) **list scheduling**





# A classification of scheduling heuristics

Schedule within a basic block (local)

- instructions cannot move past basic block boundaries
  - schedule covers only one basic block Example technique: (priority) **list scheduling**



M[z] = ...



### Loop scheduling

- instructions cannot move past basic block boundaries
  - each schedule covers body of a loop
- exploits/reflects pipeline structure of modern processors
   Example technique: SW pipelining, modulo scheduling

# Local scheduling: list scheduling

Advantage: can disregard control dependencies

- Input: data dependence graph of straight-line code, annotated with (conservative) latencies
  - instruction forms annotated with suitable type of Functional Units
    - #available Functional Units of each type



| Integer-ALU | FP | MEM |
|-------------|----|-----|
| 2           | 1  | 1   |

# Local scheduling: list scheduling

Advantage: can disregard control dependencies

- Input: data dependence graph of straight-line code, annotated with (conservative) latencies
  - instruction forms annotated with suitable type of Functional Units
    - #available Functional Units of each type



**Output**: cycle-accurate assignment of instructions to functional units

| Cycle | ALU1 | ALU2 | FP | MEM |
|-------|------|------|----|-----|
| 1     |      |      |    |     |
| 2     |      |      |    |     |
| 3     |      |      |    |     |
| 4     |      |      |    |     |
| 5     |      |      |    |     |
| 6     |      |      |    |     |

Can be refined for pipelined architectures, where latency != reservation period for FU

# Local scheduling: list scheduling

Advantage: can disregard control dependencies

- Input: data dependence graph of straight-line code, annotated with (conservative) latencies
  - instruction forms annotated with suitable type of Functional Units
    - #available Functional Units of each type



**Output**: cycle-accurate assignment of instructions to functional units

| Cycle | ALU1 | ALU2 | FP | MEM |
|-------|------|------|----|-----|
| 1     | 1    |      |    | 2   |
| 2     |      |      |    |     |
| 3     |      |      |    |     |
| 4     |      | 3    |    | 4   |
| 5     |      |      |    |     |
| 6     |      |      | 5  | 6   |

Can be refined for pipelined architectures, where latency != reservation period for FU

# List scheduling: algorithm (sketch)

- 1. Insert nodes that have no predecessors into queue
- 2. Start with cycle count c=1

# List scheduling: algorithm (sketch)

- 1. Insert nodes that have no predecessors into queue
- 2. Start with cycle count c=1
- 3. While queue not empty:

priority: e.g. length of path to EXIT, maybe weighted by latency of RAW (+WAW/WAR?) deps

 select an instruction i from the queue such that all predecessors were scheduled "sufficiently long ago" (latency information)

# List scheduling: algorithm (sketch)

- 1. Insert nodes that have no predecessors into queue
- 2. Start with cycle count c=1
- 3. While queue not empty:

priority: e.g. length of path to EXIT, maybe weighted by latency of RAW (+WAW/WAR?) deps

- select an instruction i from the queue such that all predecessors were scheduled "sufficiently long ago" (latency information)
- if a functional unit **u** for **i** is available:
  - insert i in (c, u), and remove it from the queue
  - insert any successor of i into queue for which all predecessors have now been scheduled

# List scheduling: algorithm (sketch)

- 1. Insert nodes that have no predecessors into queue
- 2. Start with cycle count c=1
- 3. While queue not empty:

priority: e.g. length of path to EXIT, maybe weighted by latency of RAW (+WAW/WAR?) deps

- select an instruction i from the queue such that all predecessors were scheduled "sufficiently long ago" (latency information)
- if a functional unit **u** for **i** is available:
  - insert i in (c, u), and remove it from the queue
  - insert any successor of i into queue for which all predecessors have now been scheduled
- if no functional unit is available for **i**, select another instruction

# List scheduling: algorithm (sketch)

- 1. Insert nodes that have no predecessors into queue
- 2. Start with cycle count c=1
- 3. While queue not empty:

priority: e.g. length of path to EXIT, maybe weighted by latency of RAW (+WAW/WAR?) deps

- select an instruction i from the queue such that all predecessors were scheduled "sufficiently long ago" (latency information)
- if a functional unit **u** for **i** is available:
  - insert **i** in (**c**, **u**), and remove it from the **queue**
  - insert any successor of i into queue for which all predecessors have now been scheduled
- if no functional unit is available for **i**, select another instruction
- if no instruction from the queue was scheduled, increment c

# List scheduling: algorithm

- 1. Insert nodes that have no predecessors into queue
- 2. Start with cycle count c=1
- 3. While queue not empty:

priority: e.g. length of path to EXIT, maybe weighted by latency of RAW (+WAW/WAR?) deps

- select an instruction i from the queue such that all predecessors were scheduled "sufficiently long ago" (latency information)
- if a functional unit **u** for **i** is available:
  - insert **i** in (**c**, **u**), and remove it from the **queue**
  - insert any successor of i into queue for which all predecessors have now been scheduled
- if no functional unit is available for **i**, select another instruction
- if no instruction from the queue was scheduled, increment c

#### Variation: • start at nodes without successors and cycle count LAST

- work upwards, entering finish times of instructions in table
- availability of FU's still governed by start times

Observation: individual basic blocks often don't have much ILP

- speed-up limited
- many slots in list schedule remain empty: poor resource utilization
- problem is accentuated by deep pipelines, where many instructions could be concurrently in-flight

Q: How can we extend scheduling to many basic blocks?

Observation: individual basic blocks often don't have much ILP

- speed-up limited
- many slots in list schedule remain empty: poor resource utilization
- problem is accentuated by deep pipelines, where many instructions could be concurrently in-flight

Q: How can we extend scheduling to many basic blocks?

A: By considering sets of basic blocks that are often executed together

select instructions along frequently executed traces

e.g. by profiling, counting the acyclic path through CFG traversals of each CFG edge

- schedule trace members using list scheduling
- adjust off-trace code to deal with executions that only traverse parts of the trace

#### Details:

Joseph A. Fisher: Trace Scheduling: A Technique for Global Microcode Compaction. <u>IEEE Trans. Computers 30(7)</u>: 478-490 (1981)



 construct data dependence graph of instructions on trace, but consider liveln's of A to be read by b.

A trace t, and its neighbors.





prevents those instructions in **B2**, **B3**, **B4** that **define** variables that are **use**d in **A** from being moved **up** past **b**, by creating a WAR dependence.



 construct data dependence graph of instructions on trace, but consider liveln's of A liveln of b.

- 2. (list-)schedule instructions in t
- 3. adjust code outside of t



## Trace scheduling: compensation code S







## Trace scheduling: compensation code S



In step 2, some instructions in **B1** end up above s in **B**, others **below**.

**Copy** the latter ones into the edge  $s \rightarrow A$ , into a new block S so that they're executed when control flow follows  $B1 \rightarrow s \rightarrow A$ , but not when A is entered through a different edge.





### Trace scheduling: adjust code jumping to j



## Trace scheduling: adjust code jumping to j



In step 2, some instructions in **B2** end up **above** j in **B**, others below.

Adjust the jump in C to point to the first instruction (bundle) following the last instruction in B that stems from B2 – call the new jump target j'. Thus yellow instructions remain non-executed if control enters B from C: all instructions from B2 are above j'.



Note: if there's no yellow instruction, we're in fact adjusting j upwards: j' follows the last purple instruction.

## Trace scheduling: adjusting code jumping into **B**

Next, some instructions from **B3/B4** end up **above j'** in **B**, others **below**.





## Trace scheduling: adjust code jumping to j



Next, some instructions from B3/B4 end up above j' in B, others below. Copy the former ones into the edge  $C \rightarrow j'$ , into a new block J, ensuring that instructions following j' receive correct data when flow enters B via C.





## Trace scheduling: cleaning up S and J



Next, some instructions from B3/B4 end up above j' in B, others below. Copy the former ones into the edge  $C \rightarrow j'$ , into a new block J, ensuring that instructions following j' receive correct data when flow enters B via C.



Final cleanup: some instructions in **S** and **J** may be dead – eliminate them. Then, **S** and **J** can be (list-)scheduled or be part of the next trace.

# Pipelining

Purely sequential execution:



Pipelining - can partially overlap instructions:

| FETCH | DECODE | EXECUTE | MEM     | WRITE   |       |       |
|-------|--------|---------|---------|---------|-------|-------|
|       | FETCH  | DECODE  | EXECUTE | MEM     | WRITE |       |
|       |        | FETCH   | DECODE  | EXECUTE | MEM   | WRITE |

One instruction issued (and retired) each cycle – speedup ≈ pipeline depth

# Pipelining

#### Purely sequential execution:



Pipelining - can partially overlap instructions:

| FETCH | DECODE | EXECUTE | MEM     | WRITE   |       |       |
|-------|--------|---------|---------|---------|-------|-------|
|       | FETCH  | DECODE  | EXECUTE | MEM     | WRITE |       |
|       |        | FETCH   | DECODE  | EXECUTE | MEM   | WRITE |

One instruction issued (and retired) each cycle – speedup ≈ pipeline depth

- ... assuming that each instruction spends one cycle in each stage
  - all instruction forms visit same (sequence of) FU's
  - there are no (data) dependencies

## Pipelining for realistic processors

Different instructions visit different sets/sequences of functional units, and occasionally multiple types of functional units in the same cycle:

Example: floating point instructions on MIPS R4000 (ADD, MUL, CONV)

| FETCH | READ | UNPACK | SHIFT | ROUND | ROUND | WRITE |       |       |       |
|-------|------|--------|-------|-------|-------|-------|-------|-------|-------|
| FEIGH | READ | UNFACK | ADD   | ADD   | SHIFT | WRITE |       |       |       |
|       |      |        |       |       |       |       | -     |       |       |
| FETCH | READ | UNPACK | MULTA | MULTA | MULTA | MULTB | MULTB | ROUND | WRITE |
|       |      |        | WOLIA | MOLIA |       | WOLID | ADD   | ROOND |       |
|       |      |        |       |       |       |       |       |       |       |
| FETCH | READ | UNPACK | ADD   | ROUND | SHIFT | SHIFT | ADD   | ROUND | WRITE |

## **Pipelining for realistic processors**

Different instructions visit different sets/sequences of functional units, and occasionally multiple types of functional units in the same cycle:

Example: floating point instructions on MIPS R4000 (ADD, MUL, CONV)

| FETCU |      |        | SHIFT | ROUND | ROUND |       |
|-------|------|--------|-------|-------|-------|-------|
| FEICH | READ | UNPACK | ADD   | ADD   | SHIFT | WRITE |

| FETCH  | READ | UNPACK |       | MULTA | MULTA | MULTB | MULTB | ROUND | WRITE |
|--------|------|--------|-------|-------|-------|-------|-------|-------|-------|
| TETOIT | NEAD | UNFACK | MULIA | WULTA | MULIA | WULTD | ADD   | ROUND |       |

| FETCH | READ | UNPACK | ADD | ROUND | SHIFT | SHIFT | ADD | ROUND | WRITE |
|-------|------|--------|-----|-------|-------|-------|-----|-------|-------|
|-------|------|--------|-----|-------|-------|-------|-----|-------|-------|

Contention for FU's means some pipelinings must be avoided:



## Pipelining constraints: data dependencies

RAW dependency:



## Pipelining constraints: data dependencies

RAW dependency:



Register bypassing / operand forwarding: extra HW to communicate data directly between FU's



Result of one stage is available at another stage in the next cycle.

- illustrates use of loop unrolling and introduces terminology for full SW pipelining
  - but not useful in practice



♦ some binary op(s)

- illustrates use of loop unrolling and introduces terminology for full SW pipelining
  - but not useful in practice



♦ some binary op(s)

Scalar replacement: poor-man's alternative to alias analysis (again) but often helpful

#### **Data dependence graph of body**



#### **Data dependence graph of <u>unrolled</u> body – acyclic!**



same-iteration dependence
 cross-iteration dependence



#### Arrange in tableau

- rows: cycles
- columns: iterations
- unlimited resources



|        | 1    | 2  | 3  | 4  | 5  | 6  |
|--------|------|----|----|----|----|----|
| 1      | acfj | fj | fj | fj | fj | fj |
| 2      |      |    |    |    |    |    |
| 2<br>3 |      |    |    |    |    |    |
| 4      |      |    |    |    |    |    |
| 5      |      |    |    |    |    |    |
| 6      |      |    |    |    |    |    |
| 7      |      |    |    |    |    |    |
| 8      |      |    |    |    |    |    |
| 9      |      |    |    |    |    |    |
| 10     |      |    |    |    |    |    |
| 11     |      |    |    |    |    |    |
| 12     |      |    |    |    |    |    |
| 13     |      |    |    |    |    |    |
| 14     |      |    |    |    |    |    |
| 15     |      |    |    |    |    |    |

| Arrange in tableau                 |    | 1    | 2  | 3  | 4  | 5  | 6  |
|------------------------------------|----|------|----|----|----|----|----|
| <u>/ Intering of the tabled de</u> | 1  | acfj | fj | fj | fj | fj | fj |
|                                    | 2  | b d  |    |    |    |    |    |
| $\frown$                           | 3  |      |    |    |    |    |    |
|                                    | 4  |      |    |    |    |    |    |
|                                    | 5  |      |    |    |    |    |    |
| $b_1$                              | 6  |      |    |    |    |    |    |
| h <sub>1</sub>                     | 7  |      |    |    |    |    |    |
| g <sub>1</sub> e <sub>1</sub>      | 8  |      |    |    |    |    |    |
| a <sub>2</sub> c <sub>2</sub>      | 9  |      |    |    |    |    |    |
|                                    | 10 |      |    |    |    |    |    |
| d <sub>2</sub>                     | 11 |      |    |    |    |    |    |
| $b_2$ $(f_2)$ $h_2$                | 12 |      |    |    |    |    |    |
|                                    | 13 |      |    |    |    |    |    |
| g <sub>2</sub> e <sub>2</sub>      | 14 |      |    |    |    |    |    |
|                                    | 15 |      |    |    |    |    |    |

| Arrange in tableau              |    | 1     | 2  | 3  | 4  | 5  | 6  |
|---------------------------------|----|-------|----|----|----|----|----|
|                                 | 1  | acfj  | fj | fj | fj | fj | fj |
|                                 | 2  | b d   |    |    |    |    |    |
| $\frown$                        | 3  | e g h | а  |    |    |    |    |
|                                 | 4  |       |    |    |    |    |    |
|                                 | 5  |       |    |    |    |    |    |
| $(b_1)$                         | 6  |       |    |    |    |    |    |
| (h <sub>1</sub> )               | 7  |       |    |    |    |    |    |
|                                 | 8  |       |    |    |    |    |    |
|                                 | 9  |       |    |    |    |    |    |
|                                 | 10 |       |    |    |    |    |    |
| ↓ / <sup>•</sup> d <sub>2</sub> | 11 |       |    |    |    |    |    |
| $b_2$ $(f_2)$ $h_2$             | 12 |       |    |    |    |    |    |
|                                 | 13 |       |    |    |    |    |    |
| g <sub>2</sub> e <sub>2</sub>   | 14 |       |    |    |    |    |    |
|                                 | 15 |       |    |    |    |    |    |

| Arrange in tableau                  |    | 1     | 2   | 3  | 4  | 5  | 6  |
|-------------------------------------|----|-------|-----|----|----|----|----|
|                                     | 1  | acfj  | fj  | fj | fj | fj | fj |
|                                     | 2  | b d   |     |    |    |    |    |
|                                     | 3  | e g h | а   |    |    |    |    |
|                                     | 4  |       | b c |    |    |    |    |
| (d <sub>1</sub> )                   | 5  |       |     |    |    |    |    |
| $(b_1)$ $(f_1)$ $(f_1)$             | 6  |       |     |    |    |    |    |
| (h <sub>1</sub> )                   | 7  |       |     |    |    |    |    |
| (g <sub>1</sub> ) (e <sub>1</sub> ) | 8  |       |     |    |    |    |    |
|                                     | 9  |       |     |    |    |    |    |
|                                     | 10 |       |     |    |    |    |    |
| d <sub>2</sub>                      | 11 |       |     |    |    |    |    |
| $h_2$ $f_2$ $h_2$                   | 12 |       |     |    |    |    |    |
|                                     | 13 |       |     |    |    |    |    |
| g <sub>2</sub> e <sub>2</sub>       | 14 |       |     |    |    |    |    |
|                                     | 15 |       |     |    |    |    |    |

#### ... some more iterations.

a u b n g e a2 C2 b 2 h<sub>2</sub> 2 g<sub>2</sub> е

|    | 1     | 2   | 3   | 4   | 5   | 6  |
|----|-------|-----|-----|-----|-----|----|
| 1  | acfj  | fj  | fj  | fj  | fj  | fj |
| 2  | b d   |     |     |     |     |    |
| 3  | e g h | а   |     |     |     |    |
| 4  |       | b c |     |     |     |    |
| 5  |       | d g | а   |     |     |    |
| 6  |       | e h | b   |     |     |    |
| 7  |       |     | сg  | а   |     |    |
| 8  |       |     | d   | b   |     |    |
| 9  |       |     | e h | g   | а   |    |
| 10 |       |     |     | С   | b   |    |
| 11 |       |     |     | d   | g   | а  |
| 12 |       |     |     | e h |     | b  |
| 13 |       |     |     |     | С   | g  |
| 14 |       |     |     |     | d   |    |
| 15 |       |     |     |     | e h |    |

Identify groups of instructions; note gaps





<u>Close gaps by delaying fast</u> <u>instruction groups</u>



|    | 1    | 2          | 3   | 4   | 5   | 6  |
|----|------|------------|-----|-----|-----|----|
| 1  | acfj |            |     |     |     |    |
| 2  | b d  | fj         |     |     |     |    |
| 3  | egh  | а          |     |     |     |    |
| 4  |      | <b>þ</b> c | fj  |     |     |    |
| 5  |      | d          | a   |     |     |    |
| 6  |      | e h        | b   | fj  |     |    |
| 7  |      | S          | çg  | a   |     |    |
| 8  |      | slope 3    | d   | b   |     |    |
| 9  |      | U.         | e h | g   | fj  |    |
| 10 |      |            |     | ç   | a   |    |
| 11 |      |            |     | d   | b   |    |
| 12 |      |            |     | e h | g   | fj |
| 13 |      |            |     |     | C   | a  |
| 14 |      |            |     |     | d   | b  |
| 15 |      |            |     |     | e h | g  |

| Identify "steady state"             | – of     |    | 1     | 2                | 3   | 4   | 5   | 6  |
|-------------------------------------|----------|----|-------|------------------|-----|-----|-----|----|
| <u>slope 3</u>                      |          | 1  | acfj  |                  |     |     |     |    |
|                                     |          | 2  | b d   | fj               |     |     |     |    |
|                                     | prologue | 3  | e g h | а                |     |     |     |    |
|                                     |          | 4  |       | <mark>b</mark> c | fj  |     |     |    |
|                                     |          | 5  |       | d g              | a   |     |     |    |
| (b) (f1) The                        |          | 6  |       | e h              | b   | fj  |     |    |
| (h <sub>1</sub> )                   |          | 7  |       |                  | c g | a   |     |    |
| (g <sub>1</sub> ) (e <sub>1</sub> ) |          | 8  |       |                  | d   | b   |     |    |
|                                     |          | 9  |       |                  | e h |     | fj  |    |
|                                     |          | 10 |       |                  |     | С   | a   |    |
| ↓ <sup>d</sup> 2                    |          | 11 |       |                  |     | d   |     |    |
| $b_2$ $f_2$ $h_2$                   |          | 12 |       |                  |     | e h | g   | fj |
|                                     |          | 13 |       |                  |     |     | С   |    |
| g <sub>2</sub> e <sub>2</sub>       | _        | 14 |       |                  |     |     | d   | b  |
|                                     | epilogue | 15 |       |                  |     |     | e h | g  |

#### **Expand instructions**

- No cycle has > 5 instructions
- Instructions in a row execute in parallel; ۲ reads in RHS happen before writes in LHS
- 1. Prologue also set up i

|    |   | 1                | 2   | 3   | 4  | 5 | 6 |
|----|---|------------------|-----|-----|----|---|---|
|    | 1 | a c f j          |     |     |    |   |   |
|    | 2 | <mark>b</mark> d | fj  |     |    |   |   |
| ue | 3 | e g h            | a   |     |    |   |   |
|    | 4 |                  | b c | fj  |    |   |   |
|    | 5 |                  | d g | a   |    |   |   |
|    | 6 |                  | e h | b   | fj |   |   |
|    | 7 |                  |     | c g | a  |   |   |
|    | 0 |                  |     |     | 1. |   |   |

prolog

| $a_1 \leftarrow j_0 \diamond b_0$                            | $\mathbf{c_1} \leftarrow \mathbf{e_0} \diamond \mathbf{j_0}$ | $f_1 \in U[1]$                                               | $j_1 \in X[1]$                                               |                       |                                                                                                                   |
|--------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------|-----------------------|-------------------------------------------------------------------------------------------------------------------|
| $\mathbf{b_1} \leftarrow \mathbf{a_1} \diamond \mathbf{f_0}$ | $d_1 \leftarrow f_0 \diamond c_1$                            | f <sub>2</sub> ← U[2]                                        | j <sub>2</sub> ← X[2]                                        |                       |                                                                                                                   |
| $\mathbf{e_1} \leftarrow \mathbf{b_1} \diamond \mathbf{d_1}$ | $V[1] \leftarrow b_1$                                        | $W[1] \leftarrow d_1$                                        | $\mathbf{a_2} \leftarrow \mathbf{j_1} \diamond \mathbf{b_1}$ |                       | for i ← 1 to N                                                                                                    |
| $\mathbf{b_2} \leftarrow \mathbf{a_2} \diamond \mathbf{f_1}$ | $c_2 \leftarrow e_1 \diamond j_1$                            | f <sub>3</sub> ← U[3]                                        | j <sub>3</sub> ← X[3]                                        |                       | a <sub>i</sub> ← j <sub>i-1</sub> ◊ b <sub>i-</sub><br>b <sub>i</sub> ← a <sub>i</sub> ◊ f <sub>i-1</sub>         |
| $d_2 \leftarrow f_1 \diamond c_2$                            | <b>V[2]</b> ← b <sub>2</sub>                                 | $\mathbf{a}_3 \leftarrow \mathbf{j}_2 \diamond \mathbf{b}_2$ |                                                              |                       | $c_i \leftarrow e_{i-1} \diamond j_{i-1}$ $d_i \leftarrow f_{i-1} \diamond c_i$ $e_i \leftarrow b_i \diamond d_i$ |
| $\mathbf{e_2} \leftarrow \mathbf{b_2} \diamond \mathbf{d_2}$ | <b>W[2]</b> ← d <sub>2</sub>                                 | $\mathbf{b_3} \leftarrow \mathbf{a_3} \diamond \mathbf{f_2}$ | f <sub>4</sub> ← U[4]                                        | j <sub>4</sub> ← X[4] | e <sub>i</sub> ← b <sub>i</sub> ◊ d <sub>i</sub><br>f <sub>i</sub> ← U[i]<br>g: V[i] ← b <sub>i</sub>             |
| $\mathbf{c_3} \leftarrow \mathbf{e_2} \diamond \mathbf{j_2}$ | <b>V[3] ← b</b> <sub>3</sub>                                 | $\mathbf{a_4} \leftarrow \mathbf{j_3} \diamond \mathbf{b_3}$ |                                                              | i ← 3                 | h: W[i] $\leftarrow$ d <sub>i</sub><br>j <sub>i</sub> $\leftarrow$ X[i]                                           |

#### **Expand instructions**

- no cycle has > 5 instructions
- Instructions in a row execute in parallel; reads in RHS happen before writes in LHS

| 8  |  | d   | b   |    |    |
|----|--|-----|-----|----|----|
| 9  |  | e h |     | fj |    |
| 10 |  |     | С   | a  |    |
| 11 |  |     | d   |    |    |
| 12 |  |     | e h | g  | fj |
| 13 |  |     |     | с  | a  |

2. Loop body – also increment counter and insert (modified) exit condition

incorrect index a<sub>i</sub> in MCIML book



As expected, the loop body has one copy of each instruction a-j, plus induction variable update + test

for  $i \leftarrow 1$  to N  $a_i \leftarrow j_{i-1} \diamond b_{i-1}$   $b_i \leftarrow a_i \diamond f_{i-1}$   $c_i \leftarrow e_{i-1} \diamond j_{i-1}$   $d_i \leftarrow f_{i-1} \diamond c_i$   $e_i \leftarrow b_i \diamond d_i$   $f_i \leftarrow U[i]$   $g: V[i] \leftarrow b_i$   $h: W[i] \leftarrow d_i$  $j_i \leftarrow X[i]$ 

#### **Expand instructions**

- no cycle has > 5 instructions
- Instructions in a row execute in parallel; reads in RHS happen before writes in LHS
- 3. Loop epilogue finish all N iterations





| $\mathbf{a_1} \leftarrow \mathbf{j_0} \diamond \mathbf{b_0}$ | $\mathbf{c_1} \leftarrow \mathbf{e_0} \diamond \mathbf{j_0}$ | $f_1 \in U[1]$                                               | $j_1 \in X[1]$                    |                       |
|--------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------|-----------------------------------|-----------------------|
| $\mathbf{b_1} \leftarrow \mathbf{a_1} \diamond \mathbf{f_0}$ | $\mathbf{d_1} \leftarrow \mathbf{f_0} \diamond \mathbf{c_1}$ | $f_2 \leftarrow U[2]$                                        | j₂ ← X[2]                         |                       |
| $\mathbf{e_1} \leftarrow \mathbf{b_1} \diamond \mathbf{d_1}$ | $V[1] \leftarrow b_1$                                        | $W[1] \leftarrow d_1$                                        | $a_2 \leftarrow j_1 \diamond b_1$ |                       |
| $\mathbf{b_2} \leftarrow \mathbf{a_2} \diamond \mathbf{f_1}$ | $\mathbf{c_2} \leftarrow \mathbf{e_1} \diamond \mathbf{j_1}$ | f <sub>3</sub> ← U[3]                                        | j <sub>3</sub> ← X[3]             |                       |
| $d_2 \leftarrow f_1 \diamond c_2$                            | <b>V[2]</b> ← b <sub>2</sub>                                 | $\mathbf{a_3} \leftarrow \mathbf{j_2} \diamond \mathbf{b_2}$ |                                   |                       |
| $\mathbf{e_2} \leftarrow \mathbf{b_2} \diamond \mathbf{d_2}$ | <b>W[2]</b> ← d <sub>2</sub>                                 | $\mathbf{b_3} \leftarrow \mathbf{a_3} \diamond \mathbf{f_2}$ | f <sub>4</sub> ← U[4]             | j <sub>4</sub> ← X[4] |
| $\mathbf{c_3} \leftarrow \mathbf{e_2} \diamond \mathbf{j_2}$ | <b>V[3] ← b</b> <sub>3</sub>                                 | $\mathbf{a_4} \leftarrow \mathbf{j_3} \diamond \mathbf{b_3}$ |                                   | i ← 3                 |

| $\mathbf{d_i} \leftarrow \mathbf{f_{i-1}} \diamond \mathbf{c_i}$   | $\mathbf{b_{i+1}} \leftarrow \mathbf{a_{i+1}} \diamond \mathbf{f_i}$     |                             |                           |                           |
|--------------------------------------------------------------------|--------------------------------------------------------------------------|-----------------------------|---------------------------|---------------------------|
| $\mathbf{e}_{i} \leftarrow \mathbf{b}_{i} \diamond \mathbf{d}_{i}$ | $W[i] \leftarrow d_i$                                                    | $V[i+1] \leftarrow b_{i+1}$ | f <sub>i+2</sub> ← U[i+2] | j <sub>i+2</sub> ← X[i+2] |
| $\mathbf{c_{i+1}} \leftarrow \mathbf{e_i} \diamond \mathbf{j_i}$   | $\mathbf{a_{i+2}} \leftarrow \mathbf{j_{i+1}} \diamond \mathbf{b_{i+1}}$ | i ← i + 1                   | if i < N-2 goto L         |                           |

| $d_{N-1} \leftarrow f_{N-1} \diamond c_{N-2}$                            | $\mathbf{b_{N} \leftarrow a_{N} \diamond f_{N-1}}$ |                       |  |
|--------------------------------------------------------------------------|----------------------------------------------------|-----------------------|--|
| $\mathbf{e_{N-1}} \leftarrow \mathbf{b_{N-1}} \diamond \mathbf{d_{N-1}}$ | $W[N-1] \leftarrow d_{N-1}$                        | $V[N] \leftarrow b_N$ |  |
| $\mathbf{c_{N} \leftarrow e_{N-1} \diamond j_{N-1}}$                     |                                                    |                       |  |
| $\mathbf{d_{N} \leftarrow f_{N} \diamond c_{N-1}}$                       |                                                    |                       |  |
| $\mathbf{e_{N}} \leftarrow \mathbf{b_{N}} \diamond \mathbf{d_{N}}$       | $W[N] \leftarrow d_N$                              |                       |  |

Final step: eliminate indices i from variables – want "constant" variables/registers in body!

| $\mathbf{d_i} \leftarrow \mathbf{f_{i-1}} \diamond \mathbf{c_i}$ | $\mathbf{b_{i+1}} \leftarrow \mathbf{a_{i+1}} \diamond \mathbf{f_i}$ |                             |                             |                           |
|------------------------------------------------------------------|----------------------------------------------------------------------|-----------------------------|-----------------------------|---------------------------|
| $\mathbf{e_i} \leftarrow \mathbf{b_i} \diamond \mathbf{d_i}$     | $W[i] \leftarrow d_i$                                                | $V[i+1] \leftarrow b_{i+1}$ | f <sub>i+2</sub> ← U[i+2] ( | j <sub>i+2</sub> ← X[i+2] |
| $c_{i+1} \leftarrow e_i (j_i)$                                   | $a_{i+2} \xrightarrow{j_{i+1}} b_{i+1}$                              | i ← i + 1                   | if i < N-2 goto L           |                           |

need 3 copies of j since up to 3 copies are live:  $j_{i+2} \rightarrow j$ ,  $j_{i+1} \rightarrow j'$ ,  $j_i \rightarrow j''$ 

Final step: eliminate indices i from variables – want "constant" variables/registers in body!

| $\mathbf{d_i} \leftarrow \mathbf{f_{i-1}} \diamond \mathbf{c_i}$   | $\mathbf{b_{i+1}} \leftarrow \mathbf{a_{i+1}} \diamond \mathbf{f_i}$ |                             |                           |                           |
|--------------------------------------------------------------------|----------------------------------------------------------------------|-----------------------------|---------------------------|---------------------------|
| $\mathbf{e}_{i} \leftarrow \mathbf{b}_{i} \diamond \mathbf{d}_{i}$ | $W[i] \leftarrow d_i$                                                | $V[i+1] \leftarrow b_{i+1}$ | f <sub>i+2</sub> ← U[i+2] | j <sub>i+2</sub> ← X[i+2] |
| $c_{i+1} \leftarrow e_i (j_i)$                                     | $a_{i+2} + j_{i+1} > b_{i+1}$                                        | i ← i + 1                   | if i < N-2 goto L         |                           |

need 3 copies of j since up to 3 copies are live:  $\mathbf{j}_{i+2} \rightarrow \mathbf{j}, \mathbf{j}_{i+1} \rightarrow \mathbf{j}', \mathbf{j}_i \rightarrow \mathbf{j}''$ 

| $\mathbf{d_i} \leftarrow \mathbf{f_{i-1}} \diamond \mathbf{c_i}$   | $\mathbf{b_{i+1}} \leftarrow \mathbf{a_{i+1}} \diamond \mathbf{f_i}$ |                             |                           |                   |
|--------------------------------------------------------------------|----------------------------------------------------------------------|-----------------------------|---------------------------|-------------------|
| $\mathbf{e}_{i} \leftarrow \mathbf{b}_{i} \diamond \mathbf{d}_{i}$ | <b>W[i]</b> ← d <sub>i</sub>                                         | $V[i+1] \leftarrow b_{i+1}$ | f <sub>i+2</sub> ← U[i+2] | j <b>→</b> X[i+2] |
| $c_{i+1} \leftarrow e_i (j'')$                                     | $a_{i+2} - j' \diamond b_{i+1}$                                      | i ← i + 1                   | if i < N-2 goto L         |                   |

Final step: eliminate indices i from variables – want "constant" variables/registers in body!

| $\mathbf{d_i} \leftarrow \mathbf{f_{i-1}} \diamond \mathbf{c_i}$   | $\mathbf{b_{i+1}} \leftarrow \mathbf{a_{i+1}} \diamond \mathbf{f_i}$ |                             |                             |                           |
|--------------------------------------------------------------------|----------------------------------------------------------------------|-----------------------------|-----------------------------|---------------------------|
| $\mathbf{e}_{i} \leftarrow \mathbf{b}_{i} \diamond \mathbf{d}_{i}$ | $W[i] \leftarrow d_i$                                                | $V[i+1] \leftarrow b_{i+1}$ | f <sub>i+2</sub> ← U[i+2] ( | j <sub>i+2</sub> ← X[i+2] |
| $c_{i+1} \leftarrow e_i (j_i)$                                     | $a_{i+2} + j_{i+1} > b_{i+1}$                                        | i ← i + 1                   | if i < N-2 goto L           |                           |

need 3 copies of j since up to 3 copies are live:  $\mathbf{j}_{i+2} \rightarrow \mathbf{j}, \mathbf{j}_{i+1} \rightarrow \mathbf{j}', \mathbf{j}_i \rightarrow \mathbf{j}''$ 

| $\mathbf{d_i} \leftarrow \mathbf{f_{i-1}} \diamond \mathbf{c_i}$   | $\mathbf{b_{i+1}} \leftarrow \mathbf{a_{i+1}} \diamond \mathbf{f_i}$ | j″ ← j′                     | j' ← j                      |            |
|--------------------------------------------------------------------|----------------------------------------------------------------------|-----------------------------|-----------------------------|------------|
| $\mathbf{e}_{i} \leftarrow \mathbf{b}_{i} \diamond \mathbf{d}_{i}$ | $W[i] \leftarrow d_i$                                                | $V[i+1] \leftarrow b_{i+1}$ | f <sub>i+2</sub> ← U[i+2] ( | j ↔ X[i+2] |
| $c_{i+1} \leftarrow e_i (j'')$                                     | $a_{i+2}$ $j' \diamond b_{i+1}$                                      | i ← i + 1                   | if i < N-2 goto L           |            |

• the copies live across an iteration need to be updated in each iteration.

- also, need to initialize the live-in copies of the loop at the end of prologue (j, j')
- also, can replace the indexed live-in copies of the epilogue with primed versions
  - all this for all variables a, ...j (see book modulo typo regarding a, a')

# Summary of main steps

- 1. calculate data dependence graph of unrolled loop
- 2. schedule each instruction from each loop as early as possible
- 3. plot the tableau of iterations versus cycles
- 4. identify groups of instructions, and their slopes
- 5. coalesce the slopes by slowing down fast instruction groups
- 6. identify steady state, and loop prologue and epilogue
- 7. reroll the loop, removing the iteration-indexed variable names

#### Input:

- data dependences of loop, with latency annotations
- resource requirements of all instruction forms:





- #available Functional Units of each type, and descriptions of FU types:
  - # of instructions that can be issued in one cycle,
  - restrictions which instruction forms can be issued simultaneously etc

#### Input:

- data dependences of loop, with latency annotations
- resource requirements of all instruction forms:





- #available Functional Units of each type, and descriptions of FU types:
  - # of instructions that can be issued in one cycle,
  - restrictions which instruction forms can be issued simultaneously etc

Modulo scheduling:

- find schedule that satisfies resource and (data) dependency requirements;
   then do register allocation
- try to schedule loop body using  $\Delta$  cycles, for  $\Delta = \Delta_{min}$ ,  $\Delta_{min} + 1$ ,  $\Delta_{min} + 2$ ...
- body surrounded by prologue and epilogue as before

<u>**Observation:**</u> if resource constraints prevent an instruction from being scheduled at time **t**, they also prevent t from being scheduled at times  $\mathbf{t} + \Delta$ ,  $\mathbf{t} + 2\Delta$ , ... or indeed any **t**' with  $\mathbf{t} = \mathbf{t}' \mod \Delta$ .

**Example:**  $\Delta$ =3, machine can only execute 1 load instruction at a time, loop body from previous example

0

1

2

 $f_i \leftarrow U[i] \mid j_i \leftarrow X[i]$ 



<u>**Observation:**</u> if resource constraints prevent an instruction from being scheduled at time **t**, they also prevent t from being scheduled at times  $\mathbf{t} + \Delta$ ,  $\mathbf{t} + 2\Delta$ , ... or indeed any **t**' with  $\mathbf{t} = \mathbf{t}' \mod \Delta$ .

**Example:**  $\Delta$ =3, machine can only execute 1 load instruction at a time, loop body from previous example







<u>**Observation:**</u> if resource constraints prevent an instruction from being scheduled at time **t**, they also prevent t from being scheduled at times  $\mathbf{t} + \Delta$ ,  $\mathbf{t} + 2\Delta$ , ... or indeed any **t**' with  $\mathbf{t} = \mathbf{t}' \mod \Delta$ .

**Example:**  $\Delta$ =3, machine can only execute 1 load instruction at a time, loop body from previous example

0

1

2

0

1

2

0

1

2

 $f_i \leftarrow U[i] \mid j_i \leftarrow X[i]$ 

 $j_i \leftarrow X[i]$ 

 $j_i \leftarrow X[i]$ 

 $f_i \leftarrow U[i]$ 

 $f_i \leftarrow U[i]$ 



 $j_i \leftarrow X[i]$ 

 $j_i \leftarrow X[i]$ 

f<sub>i-1</sub> ← U [ i ]

f<sub>i+1</sub> ← U [ i ]

0=3

1

2

0

1

2 = -1

**<u>Observation</u>**: if resource constraints prevent an instruction from being scheduled at time **t**, they also prevent t from being scheduled at times **t** +  $\Delta$ , **t** + 2 $\Delta$ , ... or indeed any **t**' with **t** = **t**' mod  $\Delta$ .

**Example:**  $\Delta$ =3, machine can only execute 1 load instruction at a time, loop body from previous example







#### Interaction with register allocation:

- delaying an instruction d: z ← x op y
  - extends the liveness-range of d's uses, namely x and y; may overlap with other (iteration count-indexed) versions of z, so may need to maintain multiple copies, as in previous example
  - shortens liveness range of the def(s) of d, namely, z, to its uses; range < 1 illegal; ie need to postpone uses, too</li>
- similarly, schedudling an instruction earlier shortens the liveness ranges of its uses and extends the liveness range of its defs
- hence, scheduling affects liveness/register allocation

# Modulo scheduling: estimating $\Delta_{min}$

#### Identification of $\Delta_{\min}$ as the maximum of the following:

- resource estimator: for each FU
  - calculate requested cycles: add cycle requests of all instructions mapped to that FU
  - divide request by number of instances of the FU type
  - max over all FU types is lower bound on Δ<sub>max</sub>
- data-dependence estimator: sum of latencies along a simple cycle through the data dependence graph

# Modulo scheduling: estimating $\Delta_{min}$

#### Identification of $\Delta_{\min}$ as the maximum of the following:

- resource estimator: for each FU
  - calculate requested cycles: add cycle requests of all instructions mapped to that FU
  - divide request by number of instances of the FU type
  - max over all FU types is lower bound on Δ<sub>max</sub>
- data-dependence estimator: sum of latencies along a simple cycle through the data dependence graph

Example: 1 ALU, 1 MEM; both issue 1 instruction/cycle; instr. latency 1 cycle



Data dependence estimator: 3 (c  $\rightarrow$  d  $\rightarrow$  e  $\rightarrow$  c) ALU-estimator: 5 instrs, 1 cycle each, 1 ALU  $\rightarrow$  5 MEM-estimator: 4 instrs, 1 cycle each, 1 MEM  $\rightarrow$  4

Hence  $\Delta_{min} = 5$ 

(MEM instructions in box)

### Modulo scheduling: priority of instructions

#### Algorithm schedules instructions according to priorities

Possible metrics:

- membership in data dependence cycle of max latency
- execution on FU type that's most heavily used (resource estimate)

Example: [c, d, e, a, b, f, j, g, h]



Main data structures:

- array SchedTime, assigning to each instruction a cycle time
- table ResourceMap, assigning to each FU and cycle time < Δ an instruction</li>

|         |   |   | FU1     | FU2     |
|---------|---|---|---------|---------|
| Instr 1 | 8 | 0 | Instr 1 | Instr 4 |
| Instr 2 | 4 | 1 | Instr 2 | 11001   |
| Instr 3 | 0 | 2 |         | Instr 3 |
| :       | : | 2 |         | insu s  |
|         |   |   |         |         |

- pick highest-priority instruction that's not yet scheduled: i
- schedule i at earliest cycle that
  - respects the data dependencies w.r.t. the **already scheduled instructions**
  - has the right FU for i available
  - if i can't be scheduled for current Δ, place i without respecting resource constraint: evict current inhabitant and/or data-dependence successors of i hat are now scheduled too early. Evictees need to scheduled again.
- in principle evictions could go on forever
  - define a cut-off (heuristics) at which point  $\Delta$  is increased



[c, d, e, a, b, f, j, g, h]

| а      |  |
|--------|--|
| b      |  |
| c<br>d |  |
| d      |  |
| е      |  |
| f      |  |
| g      |  |
| g<br>h |  |
| j      |  |



- highest-priority, unscheduled instruction: c
- earliest cycle with free ALU s.t. data-deps w.r.t. scheduled instructions are respected: 0



[¢, d, e, a, b, f, j, g, h]

| а      |   |
|--------|---|
| b      |   |
| С      | 0 |
| c<br>d |   |
| е      |   |
| f      |   |
| g      |   |
| h      |   |
| j      |   |



- highest-priority, unscheduled instruction: c
- earliest cycle with free ALU s.t. data-deps w.r.t. scheduled instructions are respected: 0
- so schedule c in cycle 0



[**¢**, **⊄**, **€**, **ã**, b, f, j, g, h]

| а      | 3 |
|--------|---|
| b      |   |
| С      | 0 |
| c<br>d | 1 |
| е      | 2 |
| f      |   |
| g      |   |
| h      |   |
| j      |   |



- highest-priority, unscheduled instruction: d
- earliest cycle with free ALU s.t. data-deps w.r.t. scheduled instructions are respected: 1
- so schedule d in cycle 1

Similarly:  $e \rightarrow 2$ ,  $a \rightarrow 3$ . Next instruction: b



[**¢**, **⊄**, **€**, **ã**, b, f, j, g, h]

| а      | 3 |
|--------|---|
| b      |   |
| С      | 0 |
| c<br>d | 1 |
| е      | 2 |
| f      |   |
| g      |   |
| h      |   |
| j      |   |



- highest-priority, unscheduled instruction: d
- earliest cycle with free ALU s.t. data-deps w.r.t. scheduled instructions are respected: 1
- so schedule d in cycle 1

Similarly:  $e \rightarrow 2$ ,  $a \rightarrow 3$ . Next instruction: b Earliest cycle in which ALU is available: 4. But: b's successor e is scheduled in (earlier) cycle 2! Hence: place b in cycle 4, but evict e.



[𝕵, 𝔩, e, 𝙇, 𝔈, f, j, g, h]

| а      | 3 |  |
|--------|---|--|
| b      | 4 |  |
| С      | 0 |  |
| c<br>d | 1 |  |
| е      | 7 |  |
| f      |   |  |
| g      |   |  |
| h      |   |  |
| j      |   |  |



- highest-priority, unscheduled instruction: d
- earliest cycle with free ALU s.t. data-deps w.r.t. scheduled instructions are respected: 1
- so schedule d in cycle 1

Similarly:  $e \rightarrow 2$ ,  $a \rightarrow 3$ . Next instruction: b Earliest cycle in which ALU is available: 4. But: b's successor e is scheduled in (earlier) cycle 2! Hence: place b in cycle 4, but evict e.



[𝕵, 𝔩, e, 𝙇, 𝔈, f, j, g, h]

| а | 3 |
|---|---|
| b | 4 |
| С | 0 |
| d | 1 |
| е | 7 |
| f |   |
| g |   |
| h |   |
| j |   |



- highest-priority, unscheduled instruction: e
- ALU-slot for e: 2 (again)
- But: data dependence e → c violated yes, cross iteration deps count!
   So: schedule e in cycle 7 (= 2 mod Δ), but evict c see next slide...



[c, **d**, **€**, **∂**, **b**, f, j, g, h]

| а | 3             |
|---|---------------|
| b | 4             |
| С | <b>Ø</b><br>1 |
| d | 1             |
| е | <b>7</b> 7    |
| f |               |
| g |               |
| h |               |
| j |               |



- highest-priority, unscheduled instruction: c
- ALU-slot for c: 0 (again)
- But: data dependence  $c \rightarrow d$  violated

So, schedule c in cycle 5 (= 0 mod  $\Delta$ ), but evict d – see next slide...



[¢, d, ∉, ã, ∕o, f, j, g, h]

| а | 3             |
|---|---------------|
| b | 4             |
| С | 4<br>Ø 5<br>1 |
| d | 1             |
| е | 27            |
| f |               |
| g |               |
| h |               |
| j |               |



- highest-priority, unscheduled instruction: d
- ALU-slot for d: 1 (again)
- Hooray data dependence  $d \rightarrow e$  respected So, schedule d in cycle 6 (= 1 mod  $\Delta$ ). No eviction – see next slide...



[¢, ¢, ∉, ∅, b, f, j, g, h]

| а | 3                        |
|---|--------------------------|
| b | 4                        |
| С | <b>Ø</b> 5               |
| d | <b>1</b> 6<br><b>1</b> 7 |
| е | 27                       |
| f |                          |
| g |                          |
| h |                          |
| j |                          |



- highest-priority, unscheduled instruction: f
- MEM-slot for f: 0; no data-deps, so schedule f:0



[¢, ⊄, ∉, ã, ⊅, ∮, j, g, h]

| а | 3                        |
|---|--------------------------|
| b | 4                        |
| С | <b>Ø</b> 5               |
| d | <b>1</b> 6<br><b>1</b> 7 |
| е | 27                       |
| f | 0                        |
| g |                          |
| h |                          |
| j |                          |



- highest-priority, unscheduled instruction: f
- MEM-slot for f: 0; no data-deps, so schedule f:0
- highest-priority, unscheduled instruction: j
- MEM-slot for j: 1; no data-deps, so schedule j:1
- highest-priority, unscheduled instruction: g
- MEM-slot for g: 2; and earliest cycle c = 2+ k\*∆ where data-dep b→g is respected is 7.
   So schedule g:7 – see next slide...



[¢, ¢, ∉, ∅, ∕b, ∮, j, g, h]

| а | 3          |
|---|------------|
| b | 4          |
| С | <b>Ø</b> 5 |
| d | / 6<br>/ 7 |
| е | 27         |
| f | 0          |
| g | 7          |
| h |            |
| j | 1          |



- highest-priority, unscheduled instruction: h
- MEM-slot for h: 3; earliest cycle c = 3 + k<sup>\*</sup>Δ where data-dep d → h is respected is 8.
   So schedule h:8 – final schedule on next slide.



[¢, ¢, ¢, õ, b, f, j, g, h]





Instructions c, d, e, g, h are scheduled 1 iteration off.



[¢, d, ∉, ã, ∕o, f, j, g, h]

| а | 3          |
|---|------------|
| b | 4          |
| С | <b>Ø</b> 5 |
| d | ∮05<br>≠   |
| е | 27         |
| f |            |
| g |            |
| h |            |
| j |            |



- highest-priority, unscheduled instruction: c
- ALU-slot for c: 0 (again)
- But: data dependence  $c \rightarrow d$  violated

So, schedule c in cycle 5 (=  $0 \mod \Delta$ ), but evict d...

# Summary of scheduling

Challenges arise from interaction between

- program properties: data dependencies (RAW, WAR, WAW) and control dependencies
- hardware constraints (FU availability, latencies, ...)

Optimal solutions typically infeasible  $\rightarrow$  <u>heuristics</u>

Scheduling within a basic block (local): list scheduling





Scheduling across basic blocks (global): trace scheduling

Loop scheduling: SW pipelining, modulo scheduling

