2. A directed data-flow architecture has only one locus of control but simulates multiple loci of control by using predicated instructions and having the compiler do code motion between basic blocks.
3. Directed data-flow extends on VLIW by adding conditional scheduling
(predication) and a context register matrix for conflict free functional
unit register use.
The Cydra 5 is highly engineered toward loop intensive numeric calculations.
It is able to out-perform vector processors because it can allocate "iteration
frames". In other words, each iteration of a loop can have its own set
of registers. This allows one iteration to access values calculated in
other iterations, since register values won’t be overwritten. So Cydra
5 will outperform a vector processor when there are recurrences in loops
and should perform as well as vector processors when there are no recurrences.
The Cydra 5 has a large number of registers (1 register file per functional
unit plus a general-purpose register file and an iteration control register
file). Since each register file has 64 registers (except ICR with 128)
we can see there are 448 operand registers (6 functional units, 1 GPR:
7x64 registers), and 128 predication registers (ICR). From figure 7 we
can see that all of the operand registers are exposed by the ISA. The source
operands have 9 bits (6 bits would allow access to 64 registers, and 3
additional bits can specify which register file) so 2^9 or 512 potential
registers can be accessed. Similarly the destination registers have need
7 bits (6 bits for the 64 registers and 1 bit to indicate the GPR or functional
unit register file).
Cydra 5 has pseudo-random interleaved memory architecture with no cache.
There is no cache on the assumption that numeric calculations may walk
through an entire array of memory before returning to the beginning again,
and cache would add no benefit here.
It seems that the Cydra 5 would have a poor cost/performance ratio
for scalar computation. Since there is no cache, memory access will be
slow. Most of the 448 general registers would be unused in scalar calculations.
If scheduling couldn’t effectively fill multiop instructions, then uniops
might be used, and this would preclude the use of predication (the ISA
doesn’t expose predicate registers to uniop). So most of the benefit of
the Cydra 5 is from loops with recurrences.
Questions:
1. Since the hardware allocates iteration frames from the register
files, and yet the ISA exposes all of the registers, it seems like there
could be a conflict. In the extreme case, imagine the compiler uses all
448 registers in every iteration of a loop with recurrences. Then how can
the hardware allocate iteration frames, or how will it know which registers
it can’t use for an iteration frame.
2. I wonder if anyone has measured the average amount of parallelism that exists in a program using the data-flow model and assuming no hardware restrictions. Obviously the upper bound occurs when there are no dependencies between instructions, the entire program could be issued in one cycle so O(n) parallelism (n being the number of instructions executed in the program).
[1] B. R. Rau, D. W. L. Yen, W. Yen, R. A. Towle, "The Cydra 5 Departmental Supercomputer - Design Philosophies, Decisions, and Trade-offs", IEEE Computer, 22(1):12-35, January 1989.