Observation 5:

The Cydra 5 Departmental Supercomputer [1]

The Cydra 5 is a directed data-flow machine engineered for numeric processing (loop intensive).

Distinctions:
1. A data-flow architecture allows an operation to be executed as soon as its data and control dependencies are satisfied. This means there may be multiple loci of control active simultaneously.

2. A directed data-flow architecture has only one locus of control but simulates multiple loci of control by using predicated instructions and having the compiler do code motion between basic blocks.

3. Directed data-flow extends on VLIW by adding conditional scheduling (predication) and a context register matrix for conflict free functional unit register use.

The Cydra 5 is highly engineered toward loop intensive numeric calculations. It is able to out-perform vector processors because it can allocate "iteration frames". In other words, each iteration of a loop can have its own set of registers. This allows one iteration to access values calculated in other iterations, since register values won’t be overwritten. So Cydra 5 will outperform a vector processor when there are recurrences in loops and should perform as well as vector processors when there are no recurrences.

The Cydra 5 has a large number of registers (1 register file per functional unit plus a general-purpose register file and an iteration control register file). Since each register file has 64 registers (except ICR with 128) we can see there are 448 operand registers (6 functional units, 1 GPR: 7x64 registers), and 128 predication registers (ICR). From figure 7 we can see that all of the operand registers are exposed by the ISA. The source operands have 9 bits (6 bits would allow access to 64 registers, and 3 additional bits can specify which register file) so 2^9 or 512 potential registers can be accessed. Similarly the destination registers have need 7 bits (6 bits for the 64 registers and 1 bit to indicate the GPR or functional unit register file).

Cydra 5 has pseudo-random interleaved memory architecture with no cache. There is no cache on the assumption that numeric calculations may walk through an entire array of memory before returning to the beginning again, and cache would add no benefit here.

It seems that the Cydra 5 would have a poor cost/performance ratio for scalar computation. Since there is no cache, memory access will be slow. Most of the 448 general registers would be unused in scalar calculations. If scheduling couldn’t effectively fill multiop instructions, then uniops might be used, and this would preclude the use of predication (the ISA doesn’t expose predicate registers to uniop). So most of the benefit of the Cydra 5 is from loops with recurrences.

Questions:
1. Since the hardware allocates iteration frames from the register files, and yet the ISA exposes all of the registers, it seems like there could be a conflict. In the extreme case, imagine the compiler uses all 448 registers in every iteration of a loop with recurrences. Then how can the hardware allocate iteration frames, or how will it know which registers it can’t use for an iteration frame.

2. I wonder if anyone has measured the average amount of parallelism that exists in a program using the data-flow model and assuming no hardware restrictions. Obviously the upper bound occurs when there are no dependencies between instructions, the entire program could be issued in one cycle so O(n) parallelism (n being the number of instructions executed in the program).

[1] B. R. Rau, D. W. L. Yen, W. Yen, R. A. Towle, "The Cydra 5 Departmental Supercomputer - Design Philosophies, Decisions, and Trade-offs", IEEE Computer, 22(1):12-35, January 1989.