|                                                                                                                                                                                |                                                                                                         |                                                                    | Topics                                                                                                                                                                                |                                                                                                         |                                                                                    |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|--|
| Scalable Multiprocessors                                                                                                                                                       |                                                                                                         | <ul> <li>Network interface</li> <li>Interconnection net</li> </ul> | <ul> <li>Supporting programming models</li> <li>Network interface</li> <li>Interconnection network</li> </ul>                                                                         |                                                                                                         |                                                                                    |  |
| Limited Scalin                                                                                                                                                                 | g of a Bus                                                                                              |                                                                    | Comparing with a                                                                                                                                                                      | a LAN                                                                                                   |                                                                                    |  |
| Characteristic<br>Physical Length<br>Number of Connections<br>Maximum Bandwidth<br>Interface to Comm. medium<br>Global Order<br>Protection<br>Trust<br>OS<br>comm. abstraction | Bus<br>~ 1 ft<br>fixed<br>fixed<br>memory<br>arbitration<br>Virtual ⇒ physical<br>total<br>single<br>HW | P <sub>1</sub><br>P <sub>n</sub><br>\$<br><br>\$<br>MEM I/O        | <u>Characteristic</u><br>Physical Length<br>Number of Connections<br>Maximum Bandwidth<br>Interface to Comm. medium<br>Global Order<br>Protection<br>Trust<br>OS<br>comm. abstraction | Bus<br>~ 1 ft<br>fixed<br>fixed<br>memory<br>arbitration<br>Virtual ⇒ physical<br>total<br>single<br>HW | LAN<br>KM<br>many<br>???<br>peripheral<br>???<br>OS<br>little<br>independent<br>SW |  |
| <ul><li>Scaling limit</li><li>Close coupli</li></ul>                                                                                                                           | ng among componer                                                                                       | nts                                                                | <ul> <li>No clear limit to pl</li> </ul>                                                                                                                                              | hysical scaling, little<br>ensus difficult to ac                                                        | e trust, no                                                                        |  |

3

#### Scalable Computers

- What are the design trade-offs for the spectrum of machines between?
  - Specialize or commodity nodes?
  - Capability of node-to-network interface
  - Supporting programming models?
- What does scalability mean?
  - Avoid inherent design limits on resources
  - Bandwidth increases with n
  - Latency does not increase with n
  - Cost increases slowly with n

# Programming Models Realized by Protocols



## **Bandwidth Scalability**



- one-way transfer of information from a source output buffer to a dest. input buffer
  - causes some action at the destination
  - occurrence is not directly visible at source
- deposit data, state change, reply

7

#### Shared Address Space Abstraction



- Source and destination data addresses are specified by the source of the request
  - a degree of logical coupling and trust
- no storage logically "outside the application address space(s)"
  - But it may employ temporary buffers for transport
- Operations are fundamentally request / response
- Remote operation can be performed on remote memory
  - logically does not require intervention of the remote processor

- Bulk transfers
- Complex synchronization semantics

The Fetch Deadlock Problem

- more complex protocols
- More complex action
- Synchronous
  - Send completes after matching recv and source data sent
  - Receive completes after data transfer complete from matching send
- Asynchronous
  - Send completes after send buffer may be reused

#### Synchronous Message Passing



- Constrained programming model.
- Deterministic! What happens when threads added?
- Destination contention very limited.

#### Asynchronous MSG Passing: Conservative



- Where is the buffering?
- Contention control? Receiver initiated protocol?
- Short message optimizations

# Asynchronous Message Passing: Optimistic



- More powerful programming model
- Wildcard receive => non-deterministic
- Storage required within msg layer?

# Key Features of Msg Passing Abstraction

- Source knows send data address, destination knows receive data address
  - after handshake they both know
- Arbitrary storage "outside the local address spaces"
  - may post many sends before any receives
  - non-blocking asynchronous sends reduces the requirement to an arbitrary number of descriptors
    - fine print says these are limited too
- Fundamentally a 3-phase transaction
  - includes a request / response
  - can use optimistic 1-phase in limited "Safe" cases

13

#### **Network Interface**

- Transfer between local memory and NIC buffers
  - SW translates VA ⇔ PA
  - SW initiate DMA
  - SW does buffer management
  - NIC initiates interrupts on receive
  - Provides protection
- Transfer between NIC buffers and the network
  - Generate packets
  - Flow control with the network



#### **Network Performance Metrics**



Includes header/trailer in BW calculation?

#### Protected User-Level Communication

- Traditional NIC (e.g. Ethernet) requires OS kernel to initiate DMA and to manage buffers
  - Prevent apps from crashing OS or other apps
  - Overhead is high (how high?)
- Multicomputer or multiprocessor NICs
  - OS maps VA to PA buffers
  - Apps initiate DMAs using VA addresses or handles of descriptors
  - NIC use mapped PA buffers to perform DMAs
- Examples
  - Research: Active message, UDMA
  - Industry: VIA and RDMA

#### User Level Network ports



- Appears to user as logical message queues plus status
- What happens if no user pop?

17

#### **User Level Abstraction**



- Any user process can post a transaction for any other in protection domain
  - communication layer moves OQ<sub>src</sub> -> IQ<sub>dest</sub>
  - may involve indirection: VAS<sub>src</sub> -> VAS<sub>dest</sub>

#### Scalable Interconnection Network

- At core of parallel computer architecture
- Requirements and trade-offs at many levels
  - Elegant mathematical structure
  - Deep relationships to algorithm structure
  - Managing many traffic flows
  - Electrical / optical link properties
- Little consensus
  - interactions across levels
  - Performance metrics
  - Cost metrics
  - Workload
- Need holistic understanding



# Generic Multiprocessor Architecture



Network characteristics

- Network bandwidth: on-chip and off-chip interconnection network
- Bandwidth demands: independent and communicating threads/processes
- Latency: local and remote

Requirements from Above

- Communication-to-computation ratio
  - $\Rightarrow$  bandwidth that must be sustained for given computational rate
  - Traffic localized or dispersed?
  - Bursty or uniform?
- Programming Model
  - Protocol
  - Granularity of transfer
  - Degree of overlap (slackness)
- The job of a parallel machine's interconnection network is to transfer information from source node to destination node in support of network transactions that realize the programming model

21

#### Characteristics of A Network

- Topology
  - Physical interconnection structure of the network graph

(what)

(how)

- Direct: node connected to every switch
- Indirect: nodes connected to specific subset of switches
- Routing Algorithm (which)
  - Restricts the set of paths that messages may follow
  - Many algorithms with different properties
- Switching Strategy
  - How data in a message traverses a route
  - Store and forward vs. cut through
- Flow Control Mechanism (when)
  - When a message or portions of it traverse a route
  - What happens when traffic is encountered?

25

#### **Network Basics**



- Link made of some physical media
  - wire, fiber, air
- with a transmitter (tx) on one end
  - converts digital symbols to analog signals and drives them down the link
- and a receiver (rx) on the other
  - captures analog signals and converts them back to digital signals
- tx+rx called a transceiver

# **Basic Definitions**

- Network interface
  - Communication between a node and the network
- Links
  - Bundle of wires or fibers that carries signals
- Switches
  - Connects fixed number of input channels to fixed number of output channels

26

# Traditional Network Media



## **Emerging Media**

- Proximity project (Sun Microsystems)
  - Potentially deliver TB/sec between chips
  - Microscopic metal pads coated with a micron-thin layer of insulator to protect the chip from static electricity
  - Two chips contact each other



#### 29

#### Networks in Parallel Machines

#### Some old machines

| Machine       | Topology  | Cycle Time<br>(ns) | Channel<br>Width<br>(bits) | Routing<br>Delay<br>(cycles) | Flit<br>(data bits) |
|---------------|-----------|--------------------|----------------------------|------------------------------|---------------------|
| nCUBE/2       | Hypercube | 25                 | and the lat                | 40                           | 32                  |
| TMC CM-5      | Fat-Tree  | 25                 | 4                          | 10                           | 4                   |
| IBM SP-2      | Banyan    | 25                 | 8                          | 5                            | 16                  |
| Intel Paragon | 2D Mesh   | 11.5               | 16                         | 2                            | 16                  |
| Meiko CS-2    | Fat-Tree  | 20                 | 8                          | 7                            | 8                   |
| CRAY T3D      | 3D Torus  | 6.67               | 16                         | 2                            | 16                  |
| DASH          | Torus     | 30                 | 16                         | 2                            | 16                  |
| J-Machine     | 3D Mesh   | 31                 | 8                          | 2                            | 8                   |
| Monsoon       | Butterfly | 20                 | 16                         | 2                            | 16                  |
| SGI Origin    | Hypercube | 2.5                | 20                         | 16                           | 160                 |
| Myricom       | Arbitrary | 6.25               | 16                         | 50                           | 16                  |

#### New machines

- Cray XT3 and XT4: 3D torus, 7GB/sec each link
- IBM Bluegene/L: 3D torus, 1.4Gb/sec each link

30

#### Linear Arrays and Rings



Linear Array

Torus

Torus arranged to use short wires

- Linear Array
  - Diameter?
  - Average Distance?
  - Bisection bandwidth?
  - Route A -> B given by relative address R = B-A
- Torus?
- Examples: FDDI, SCI, FiberChannel Arbitrated Loop, KSR1

#### Multidimensional Meshes and Tori







3D Cube

- d-dimensional array
  - $n = k_{d-1} X \dots X k_0$  nodes
  - described by *d*-vector of coordinates (i<sub>d-1</sub>, ..., i<sub>O</sub>)
- d-dimensional k-ary mesh: N = k<sup>d</sup>
  - k = <sup>d</sup>√N
  - described by *d*-vector of radix k coordinate
- d-dimensional k-ary torus (or k-ary d-cube)?

#### Properties

 Routing • relative distance:  $R = (b_{d-1} - a_{d-1}, ..., b_0 - a_0)$ • traverse ri = b<sub>i</sub> - a<sub>i</sub> hops in each dimension • dimension-order routing Average Distance Wire Length? • d x 2k/3 for mesh dk/2 for cube 6 x 3 x 2 Degree? Partitioning? **Bisection bandwidth?**  k <sup>d-1</sup> bidirectional links Embed multiple logical dimension in one Physical layout? • 2D in O(N) space physical dimension using long wires Short wires • higher dimension? 33

# Hypercubes

- Also called binary n-cubes. # of nodes = N = 2<sup>n</sup>.
- O(logN) Hops
- Good bisection BW
- Complexity
  - Out degree is n = logN



• with random comm. 2 ports per processor





# Multistage Network



**Embeddings in Two Dimensions** 

- Routing from left to right
- Typically n = log(p)

35

#### Trees



- Diameter and ave distance logarithmic
  - k-ary tree, height d = log<sub>k</sub> N
  - address specified d-vector of radix k coordinates describing path down from root
- Fixed degree
- Route up to common ancestor and down
  - R = B xor A
  - let i be position of most significant 1 in R, route up i+1 levels
  - down in direction given by low i+1 bits of B
- H-tree space is O(N) with O( $\sqrt{N}$ ) long wires
- Bisection bandwidth?

#### **Topology Summary**

| Topology     | Degree   | Diameter                 | Ave Dist             | Bisection         | D (D ave | e) @ P=1024 |
|--------------|----------|--------------------------|----------------------|-------------------|----------|-------------|
| 1D Array     | 2        | N-1                      | N/3                  | 1                 | huge     |             |
| 1D Ring      | 2        | N/2                      | N/4                  | 2                 |          |             |
| 2D Mesh      | 4        | 2 (N <sup>1/2</sup> - 1) | 2/3 N <sup>1/2</sup> | N <sup>1/2</sup>  | 63 (21)  |             |
| 2D Torus     | 4        | N <sup>1/2</sup>         | 1/2 N <sup>1/2</sup> | 2N <sup>1/2</sup> | 32 (16)  |             |
| k-ary n-cube | 2n       | nk/2                     | nk/4                 | nk/4              | 15 (7.5) | @n=3        |
| Hypercube    | n =log N | N                        | n                    | n/2               | N/2      | 10 (5)      |

- All have some "bad permutations"
  - many popular permutations are very bad for meshes (transpose)
  - randomness in wiring or routing makes it hard to find a bad one!

#### **Fat-Trees**



#### How Many Dimensions?

- ♦ n = 2 or n = 3
  - Short wires, easy to build
  - Many hops, low bisection bandwidth
  - Requires traffic locality
- ♦ n >= 4
  - Harder to build, more wires, longer average length
  - Fewer hops, better bisection bandwidth
  - Can handle non-local traffic
- k-ary d-cubes provide a consistent framework for comparison
  - N = kd
  - scale dimension (d) or nodes per dimension (k)
  - assume cut-through

37

#### **Routing Mechanism**

 Need to select output port for each input packet • in a few cycles  $P_3$  $P_2$  $\mathbf{P}_1$ Po Simple arithmetic in regular topologies Source-based ex: Dx, Dy routing in a grid message header carries series of port selects • west (-x) Dx < 0 used and stripped en route • east (+x) Dx > 0 CRC? Packet Format? Dx = 0, Dy < 0• south (-y) Table-driven • north (+y) Dx = 0, Dy > 0• message header carried index for next port at next switch processor Dx = 0, Dy = 0• o = R[i] Reduce relative address of each dimension in order table also gives index for following hop Dimension-order routing in k-ary d-cubes • o, l' = R[i] • e-cube routing in n-cube • ATM, HPPI 41

#### Properties of Routing Algorithms

- Deterministic
  - route determined by (source, dest), not intermediate state (i.e. traffic)
- Adaptive
  - route influenced by traffic along the way
- Minimal
  - only selects shortest paths
- Deadlock free
  - no traffic pattern can lead to a situation where no packets mover forward

# **Routing Messages**

- Shared Media
  - Broadcast to everyone

Routing Mechanism (cont)

- Options:
  - Source-based routing: message specifies path to the destination (changes of direction)
  - Destination-based routing: message specifies destination, switch must pick the path
    - · deterministic: always follow same path
    - · adaptive: pick different paths to avoid congestion, failures
    - Randomized routing: pick between several good paths to balance network load

#### **Deadlock Freedom**



#### Store and Forward vs. Cut-Through

- Store-and-forward
  - each switch waits for the full packet to arrive in switch before sending to the next switch
  - Applications: LAN or WAN
- Cut-through routing
  - switch examines the header, decides where to send the message
  - starts forwarding it immediately

#### Store&Forward vs Cut-Through Routing

**Deterministic Routing Examples** 



# Cut-Through vs. Wormhole Routing

- In wormhole routing, when head of message is blocked, message stays strung out over the network, potentially blocking other messages (needs only buffer the piece of the packet that is sent between switches).
- Cut through routing\_lets the tail continue when head is blocked, accordioning the whole message into a single switch. (Requires a buffer large enough to hold the largest packet).
- References
  - P. Kermani and L. Kleinrock, Virtual cut-through: A new computer communication switching technique. Computer Networks, vol. 3, pp. 267-286, 1979.
  - W.J. Dally and C.L. Seitz, "Deadlock-Free Message Routing in Multiprocessor Interconnection Networks," *IEEE Trans. Computers,* Vol. C-36, No. 5, May 1987, pp. 547-553

#### Contention



- Two packets trying to use the same link at same time
  - limited buffering
  - drop?
- Most parallel mach. networks block in place
  - link-level flow control
  - tree saturation
- Closed system offered load depends on delivered

49

#### Flow Control

- What do you do when push comes to shove?
  - ethernet: collision detection and retry after delay
  - FDDI, token ring: arbitration token
  - TCP/WAN: buffer, drop, adjust rate
  - any solution must adjust to output rate
- Link-level flow control



# Smoothing the flow



How much slack do you need to maximize bandwidth?

51

#### Virtual Channels



 W.J. Dally, "Virtual-Channel Flow Control," Proceedings of the 17th annual international symposium on Computer Architecture, p.60-68, May 28-31, 1990

#### Bandwidth

- What affects local bandwidth?
  - packet density
  - routing delay
- b x n / (n + ne + wD)

b x n/(n + ne)

- contention
  - endpoints
  - within the network
- Aggregate bandwidth
  - bisection bandwidth
    - sum of bandwidth of smallest set of links that partition the network
  - total bandwidth of all the channels: Cb
  - suppose N hosts issue packet every M cycles with ave dist
    - each msg occupies h channels for  $\mathsf{I}=\mathsf{n}/\mathsf{w}$  cycles each
    - C/N channels available per node
    - link utilization r = MC/NhI < 1

#### Some Examples

- T3D: Short, Wide, Synchronous (300 MB/s)
  - 24 bits
    - 16 data, 4 control, 4 reverse direction flow control
  - single 150 MHz clock (including processor)
  - flit = phit = 16 bits
  - two control bits identify flit type (idle and framing)
    - no-info, routing tag, packet, end-of-packet
- T3E: long, wide, asynchronous (500 MB/s)
  - 14 bits, 375 MHz
  - flit = 5 phits = 70 bits
    - 64 bits data + 6 control
  - switches operate at 75 MHz
  - framed into 1-word and 8-word read/write request packets
- Cost = f(length, width) ?

# Switches



- With virtual channels, a buffer becomes multiple buffers
- Who selects a virtual channel?

53

#### Switch Components

- Output ports
  - transmitter (typically drives clock and data)
- Input ports
  - synchronizer aligns data signal with local clock domain
  - essentially FIFO buffer
- Crossbar
  - connects each input to any output
  - degree limited by area or pinout
- Buffering
- Control logic
  - complexity depends on routing logic and scheduling algorithm
  - determine output port for each incoming packet
  - arbitrate among inputs directed at same output

#### Bluegene/L: A Low Power Design



57

#### **Comparing Systems**

|                        | ASCI<br>White | ASCI Q | Earth<br>Simulator | Blue<br>Gene/L |
|------------------------|---------------|--------|--------------------|----------------|
| Machine<br>Peak (TF/s) | 12.3          | 30     | 40.96              | 367            |
| Total Mem.<br>(TBytes) | 8             | 33     | 10                 | 32             |
| Footprint<br>(sq ft)   | 10,000        | 20,000 | 34,000             | 2,500          |
| Power (MW)             | 1             | 3.8    | 6-8.5              | 1.5            |
| Cost (\$M)             | 100           | 200    | 400                | 100            |
| # Nodes                | 512           | 4096   | 640                | 65,536         |
| MHz                    | 375           | 1000   | 500                | 700            |

# Supercomputer Peak Performance



#### BlueGene/L Compute ASIC







• 1.5/2.5 Volt





61

#### BlueGene/L Interconnection Networks

Interconnects all compute nodes (65,536)

**3 Dimensional Torus** 





#### Bluegene/L: 16384 nodes (IBM Rochester)



#### Summary

- Scalable multicomputers must consider scaling issues in bandwidth, latency, power and cost
- Network interface design
  - Substantially reduce the send and receive overheads
- Networks need to support programming models and applications well
- Many network topologies have been studied, the most common ones are meshes, tori, tree and multi-stage
- Current network routers use virtual cut through, wormhole routing with virtual channels
- New-generation scalable computers must consider power scaling