# Topic 10: Pipelining

### COS / ELE 375

# Computer Architecture and Organization

Princeton University Fall 2015

Prof. David August

# Pipelining is Natural: Assembly Line!

#### Laundry Example

- Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
- Washer takes 30 minutes
- Dryer takes 30 minutes
- "Folder" takes 30 minutes
- "Stasher" takes 30 minutes to put clothes into drawers











# Sequential Laundry



Sequential laundry takes 8 hours for 4 loads
If they learned pipelining, how long would laundry take?



Pipelined laundry takes 3.5 hours for 4 loads!

# **Slow Dryers**



5.5 Hours. What is going on here?

# **Pipelining Lessons**



- Pipelining doesn't help latency of single task, it helps throughput of entire workload
- Multiple tasks operate simultaneously using different resources
- 3. Potential speedup = Number pipe stages
- 4. Pipeline rate limited by slowest pipeline stage
- 5. Unbalanced lengths of pipe stages reduces speedup
- 6. Time to "fill" pipeline and time to "drain" it reduces speedup
- 7. Stall for Dependences



### **MIPS**

# Pipe Stages == The Five Execution Steps

- 1. Instruction Fetch
- 2. Instruction Decode and Register Fetch
- 3. Execution, Memory Address Computation, or Branch Completion
- 4. Memory Access or R-type instruction completion
- 5. Write-Back Step



# Pipelining in MIPS





10

# Can We Pipeline the Unicycle Datapath?



# Unicycle



How do we split the datapath into stages?



# Basic Idea



# Slicing of Datapath

# Rectangles are pipeline registers



# Slicing of Datapath

# Anything wrong in this picture?



# **Corrected Datapath**



Other(?) Control Signals?

#### **Another View:**



# Performance?

# (Is it worth the pain?)

Unicycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns



#### Multicycle Machine

10 ns/cycle x 4.6 CPI (inst mix) x 100 inst = 4600 ns



Ideal pipelined machine with 5 pipeline stages 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns





- One operation must complete before next can begin
- Operations spaced 33ns apart

# 3 Stage Pipeline Implementation Detail



# **Limitation 1: Nonuniform Pipelining**



- Throughput limited by slowest stage
   Delay determined by clock period \* number of stages
- Must attempt to balance stages

19

20

- Diminishing returns as we add more pipeline stages
- Register delays become limiting factor
  - Increased latency
  - Small throughput gains

Unfortunately, there are other complications...

2



## Pipeline Hazards

Next instruction cannot immediately follow previous instruction in the presence of a hazard.

Three types: Structural, Control, Data

#### Structural Hazards

- Resource oversubscription
- Suppose we had only one memory
- In laundry, think of a washer/dryer combo unit

# Pipeline Hazards

#### Control Hazards

- What is the next instruction?
- Branch instructions take time to compute this.

#### Solution 1: Stall



### Pipeline Hazards

#### **Control Hazards**

- What is the next instruction?
- Branch instructions take time to compute this.

#### Solution 2: Predict the Branch Target



### Pipeline Hazards

#### Control Hazards

- · What is the next instruction?
- Branch instructions take time to compute this.

#### Solution 2: (Mis)Predict the Branch Target



### Pipeline Hazards

#### **Control Hazards**

- What is the next instruction?
- Branch instructions take time to compute this.

#### Solution 3: Delayed Decision (Used in MIPS)



More about Branch Prediction/Delayed Branching Later...

### Pipeline Hazards

#### **Data Hazards**

Value from prior instruction is needed before write back

#### Typical Instruction (new representation):



#### Pipeline Hazards

#### **Data Hazards**

Value from prior instruction is needed before write back

#### **Data Hazard:**



# Pipeline Hazards

#### **Data Hazards**

Value from prior instruction is needed before write back

#### Load-Use Data Hazard: Options: Delayed Load or Bubble



# Summary and Real Stuff

#### **Summary**

- Pipelining is a fundamental concept in computers/nature
  - Multiple instructions in flight
  - Limited by length of longest stage, Latency vs.Throughput
- · Hazards gum up the works

#### **Real Stuff**

- MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)
- More performance from deeper pipelines, parallelism to a point
- Pentium 4 has 22 pipe stages!



# Review: Pipelined Datapath



Note that all R-Type Instructions have a NULL stage!

Review: Pipeline Hazards

Structural Hazards

# Resource oversubscription:



# Review: Pipeline Hazards

#### Control Hazards

- What is the next instruction?
- Branch instructions take time to compute this.

#### Stall, Predict, or Delay:



Pipeline Stall - only 1 cycle/stage delay...

### Review: Pipeline Hazards

#### Control Hazards

- What is the next instruction?
- Branch instructions take time to compute this.

#### Delayed Decision (Used in MIPS):



More about Branch Prediction/Delayed Branching Later...

#### Review: Pipeline Hazards

#### **Data Hazards**

Value from prior instruction is needed before write back

#### **Data Hazard:**



#### Review: Pipeline Hazards

#### **Data Hazards**

Value from prior instruction is needed before write back

#### Load-Use Data Hazard: Options: Delayed Load or Bubble





# **Pipeline Control**



# **Pipeline Control**

- Control is divided into 5 stages
- Signal values same as unicycle case!
- Timing is different...



4

# **Pipeline Control**

- Signal values same as unicycle case!
- Timing is different...
- Simplest method: Extend pipe registers



# **Pipeline Control**



### What About Data Hazards?



### What About Data Hazards?



# **Forwarding Unit**



How does the Forwarding Unit know when to forward?

# **Forwarding Unit**



#### EX Hazard:

EX/MEM.RegWrite AND EX/MEM.RegisterRd != 0 AND EX/ MEM.RegisterRd == ID/EX.RegisterReadRs(Rt)

MEM Hazard very similar, but prefer MEM over WB value

#### What About Load-Use Stall?

- Forwarding can't save the day
- Need to introduce stall in hardware or compiler



#### What About Load-Use Stall?



#### **Hazard Detection Unit**



How does the Hazard Detection Unit know when to forward?

#### **Hazard Detection Unit**



ID/EX.MemRead AND
(ID/EX.RegisterRt == IF/ID.RegisterRs OR ID/
EX.RegisterRt == IF/ID.RegisterRt)

What About Control Hazards?

(Predict Not-Taken Machine)



We are OK, as long as we squash. Can we reduce delay?

**Reduce Branch Delay** 

- Move branch address calculation to decode stage (from MEM stage)
- 2. Move branch decision up (Harder
  - Bitwise-XOR, test for zero
  - · Only need Equality testing
  - Much faster: No carry

Everything is done in decode stage!!

# What About Control Hazards?



# What About Control Hazards?





# **Review: Exceptions**

- What happens if instruction encoding is not valid?
- What about arithmetic overflow?

#### Exception

An event that disrupts program execution.

#### When an exception occurs:

- Save the current PC in the EPC
- Cause = 0 for Undefined Instruction, 1 for Overflow
- Jump to the OS at C0000000<sub>16</sub> (not vectored)

Review: Multicycle Exception Handling



# **Exceptions in Pipelines**

- Exception must appear to programmer/OS as it would in unicycle/multicycle
- Must squash in-flight instructions after excepting inst
- Looks a lot like a branch...



5

# Pipeline Exception Handling



### Look at this mess!!!



# Precise vs. Imprecise Exceptions

#### **Precise Exceptions**

- EPC has value of excepting instruction PC
- Easy for OS to handle
- We have been looking at precise exception machine

### **Imprecise Exceptions**

- Reduce pipeline complexity by putting current PC or other approximation into EPC
- OS figures it out

# Summary

- Pipelining is a fundamental concept in computers/nature
  - Multiple instructions in flight
  - Limited by length of longest stage, Latency vs.Throughput
- Hazards gum up the works
- Pipeline Control can be messy!