# Optimization of One-Bit Full Adders Embedded in Regular Structures

KAZUO IWANO AND KENNETH STEIGLITZ, FELLOW, IEEE

Abstract-We study the problem of optimizing the transistor sizes in the one-bit nMOS full adder either isolated or embedded in a regular array. A local optimization method that we call the critical-path optimization method is developed. In this method, two parameters at a time are changed along the critical path until a locally optimal choice of transistor sizes is found. The critical-path optimization method uses the Berkeley VLSI tools and the hierarchical layout language AL-LENDE developed at Princeton. First, we optimize the isolated one-bit full adder implemented in three ways: as a PLA, data selector, and with random logic. The details of the critical-path optimization method and power-time tradeoff curves are illustrated here. Second, we optimize the one-bit full adder embedded in a simple array multiplier. The entire  $3 \times 3$ ,  $4 \times 4$ ,  $8 \times 8$ , and  $10 \times 10$  multipliers are optimized and their local optima are compared. Because the optimization of the entire circuit becomes less practical when the circuit becomes larger, we develop a method that makes use of circuit regularity. We prove that some small array of one-bit full adders, called the canonical configuration, has the same local optima as the  $n \times n$  multiplier for large n, with the criterion of minimizing the delay time T. Hence, we can greatly reduce the computation load by optimizing this canonical configuration instead of optimizing the entire circuit. Experimental results confirm the effectiveness of this approach.

#### I. INTRODUCTION

**R**EGULAR arrays of cells are used often in custom chips for digital signal processing. Such regular arrays lead to designs that are easy to lay out efficiently and have high throughput. For example, bit-parallel and bitserial multipliers can be constructed from one- and twodimensional arrays of one-bit full adders, as well as a wide variety of pipelined FIR and IIR filters (see, for example, [1]-[5], [9], [12], [16], [25]). This paper is aimed at the problem of optimizing such large arrays. The technology used throughout is 4  $\mu$  nMOS, but the general approach described is applicable to other technologies as well.

We will develop what we call the *critical-path optimization method*. This is a heuristic method for finding a locally optimal choice of transistor sizes by using systematic variation of the parameters along the critical path. The optimization loop uses the Berkeley tools [15] CRYSTAL (for timing) [19], POWEST (for estimating the power consumption), and the constraint-based high-

The authors are with the Department of Computer Science, Princeton University, Princeton, NJ 08544.

IEEE Log Number 8609627.

level layout language ALLENDE, developed at Princeton [11], [13]. We illustrate the critical-path optimization method and resulting power-time tradeoff curves by optimizing an isolated one-bit full adder implemented in three topologies, namely, the PLA, data selector, and with random logic.

Investigating the optimization of the one-bit full adder embedded in an array multiplier, we will next study how to take advantage of a circuit's regularity to reduce the optimization workload. Here we develop the *canonical configuration method*, which is shown to be practical for optimizing large regular structures. We will analyze the critical paths of the  $n \times n$  multiplier using a finite automaton and extract its *canonical configuration*. Then we will prove that the optimization of this *canonical configuration* provides, in the limit of large n, the same local optima as the  $n \times n$  multiplier.

The problem of optimizing the transistor sizes has been studied by several authors. ANDY, developed by Trimberger [17], sizes transistors in a symbolic description of a chip to match the load the transistors are driving, then performs power optimization off the critical path. Glasser and Hoyte [6] developed what they called macromodels of VLSI circuits and optimize the transistor sizes in a critical path. However, their macromodel is sometimes inaccurate and leads them to errors of as much as 70 percent when compared to the SPICE [15] circuit simulation. Matson [14] improved their macromodels to be more accurate and computationally faster, and used them for nonlinear optimization of transistor sizing. Other related work is reported by Strojwas, Nassif, and Director [18], and Jouppi [23]. Our efforts will concentrate on taking advantage of circuit regularity to make practical the optimization of large arrays.

#### II. CRITICAL-PATH OPTIMIZATION METHOD

The design of VLSI chips often involves the difficult task of effecting tradeoffs among three important measures, that is, the delay time T; the peak or average power dissipation  $P_{max}$  or  $P_{ave}$ , and the area A. There are many circuit choices which can be used for controlling these tradeoffs. For example, we can control the choice of an appropriate topology, use precharging, superbuffers, insert or delete logic stages to control the appropriate fanout factor, etc. However, we will concentrate in this paper on the choice of pulldown diffusion widths, because the

Manuscript received July 30, 1985; revised March 21, 1986. This work was supported in part by the National Science Foundation under Grant ECS-8414674, by the U.S. Army Research-Durham Contract DAAG29-85-K-0191, DARPA Contract N00014-82-K-0549, by the Office of Naval Research under Grant N00014-83-K-0275, and by IBM-Japan.

problem of sizing transistors is important but very tedious work for chip designers, and lends itself well to efficient solution by automated methods.

The constraint-based high-level layout language AL-LENDE [13] enables us to parameterize a circuit; that is, ALLENDE accepts a circuit parameter vector  $\pi$  and produces the layout  $C(\pi)$ . Since the circuit performance (such as the delay time T, the power dissipation P, and the area A, etc.) is determined by the circuit, the vector (P, T, A)can be expressed as a function of C. Since the circuit C is parameterized as  $C(\pi)$ , the vector (P, T, A) can finally be expressed as a function  $g(\pi)$ .

In general, therefore, our optimization can be formalized as follows:

$$\min_{\pi} f(P, T, A) = f(g(\pi))$$

subject to constraints on P, T, and/or A. Here  $f(\cdot)$  is the cost function to be optimized, and  $\pi$  is a circuit parameter vector  $\pi = (d_1, d_2, \dots, d_n)$ . Since we optimize the transistor sizes of pulldowns, we treat each pullup/pulldown pair as a *node*. Typically, each node represents an inverter, NAND, or NOR gate. Each layout is characterized by the parameter vector  $\pi = (d_1, d_2, \dots, d_n)$ , which means that the pulldown diffusion width of node i is  $d_i\lambda$ . We also use the vector  $\kappa = (k_1, k_2, \dots, k_n)$  to mean that the pullup-to-pulldown ratio of node i is  $k_i$  [22]. The vector  $\kappa$  is fixed for each circuit.

The choice of the cost function  $f(\cdot)$  and constraints depends on the design issues. For example, in one application the clock period may be fixed at a known value  $T_0$ , and it would therefore be senseless to make the cell faster. On the other hand, peak power may be a real constraint because of heat dissipation limitations. At the same time, it may be important to keep the area small so as to fit as many cells on one chip as possible. We might, therefore, try to minimize some measure of the peak power and area (the product,  $P_{\text{max}}T$ , for example), while enforcing the constraint  $T \leq T_0$ . In other applications, speed may be critical, and it may be important to minimize T while observing constraints on P and A, and so on. In general, we would like to have enough information about the tradeoffs among the measures P, T, and A to make intelligent design decisions. As we will see, the P-T tradeoff is often of most interest, since the area is often a less sensitive function of design parameters (at least for fixed topology).

# III. IMPLEMENTATION OF THE CRITICAL-PATH OPTIMIZATION

We use a heuristic optimization method based on a critical path, and we will call our optimization method a *critical-path optimization method*. The general concept of the method is shown in Fig. 1. A circuit  $C(\pi)$  is generated based on a circuit parameter  $\pi$ . The cost function f(C) of the circuit  $C(\pi)$  is computed next. Then a desired variation  $\delta(\pi)$  of the parameter vector  $\pi$  is computed, based



Fig. 1. Critical-path optimization.



Fig. 2. Detailed flowchart of the critical-path optimization method.

on the critical path. Finally,  $\delta(\pi)$  is added to  $\pi$  and the new parameter  $(\pi + \delta \pi)$  replaces  $\pi$  if its cost is better. This is repeated until a local optimum is found. Fig. 2 shows a detailed flowchart of our implementation, which









Fig. 3. (a) Circuit diagram of the PLA. (b) Circuit diagram of the data selector. (c) Circuit diagram of the random logic circuit.

uses the tools ALLENDE, MEXTRA, CRYSTAL, POWEST. A short description of each follows below.

1) ALLENDE: This procedural constraint-based VLSI layout language produces an integrated circuit layout in Caltech Intermediate Form (CIF) corresponding to the specified circuit parameter  $\pi$  [14].

2) MEXTRA: MEXTRA reads CIF and extracts the nodes to create a circuit description for further analysis [15].

3) CRYSTAL: CRYSTAL is used for finding the critical path and the delay time of the circuit [15], [19].

4) POWEST: POWEST is used for finding the average and maximum power consumption of the circuit [15].

The basic approach we take will be to search for local improvements from random initial designs. The search strategy will be to consider all single or double changes of the current parameter vector  $\pi$  along the critical path. The idea is that the critical path indicates which parame-

ters are most important to performance at any given point in the analysis. The "k-change" method is described below in general, with simple and double change corresponding to k = 1 and 2.

Given a current parameter vector  $\pi = (d_1, d_2, \dots, d_n)$ and the critical path nodes which are on a critical path, say,  $cpn = (d_{i_1}, d_{i_2}, \dots, d_{i_m})$ , the "k-change" method picks k nodes from cpn, say,  $d_{j_1}, d_{j_2}, \dots$ , and  $d_{j_k}$ , then changes each of  $d_{j_l}, 1 \le l \le k$  by one unit and keeps the others the same. For example, "2-change" produces  $\binom{m}{2}$  $\times 2^2 = 2m(m-1)$  sets of parameters. Then each parameter is analyzed in a fixed order. When the first cost improvement is met, the current parameter is picked as the parameter for the next iteration. That is, the first improvement found is adopted.

#### IV. FULL-ADDER CIRCUIT IMPLEMENTATIONS

We used the one-bit full-adder circuit as an example for experiments because it is relatively simple, but is a basic arithmetic logic circuit and has a wide variety of uses. The one-bit fill-adder circuit can be implemented in many ways. We chose three kinds of circuits: the PLA, data selector, and random logic. All implemented circuits have been verified by ESIM [15] or SIMULATE [13].

1) PLA: Fig. 3(a) shows the full-adder circuit diagram implemented by a programmable logic array (PLA). The  $\pi$  and  $\kappa$  of the PLA are as follows:

$$\pi = (d_{and_1}, \cdots, d_{and_7}, d_{or_1}, d_{or_2}, d_{in_1}, d_{it_{12}}, d_{in_3}, d_{out_1}, d_{out_2})$$

2) Data Selector: Fig. 3(b) shows the full-adder circuit diagram of a data selector implementation [20]. The following truth table is used:

| C <sub>i</sub> | S <sub>i</sub> | So | <i>C<sub>o</sub></i> |
|----------------|----------------|----|----------------------|
| 0              | 0              | A  | $C_i$ (or $S_i$ )    |
| 0              | 1              | A  | A                    |
| 1              | 0              | Ā  | A                    |
| 1              | 1              | A  | $C_i$ (or $S_i$ )    |

This circuit selects inputs  $(A, \overline{A}, \text{ or } C_i)$  instead of calculating  $S_0$  and  $C_0$ . Here  $C_i$  is the input carry signal,  $C_0$ is the output carry signal, and  $S_0$  is the output sum signal. A and  $S_i$  denote the two other inputs. This layout has the following 7 parameters.

$$\pi = (d_A, d_{S_i}, d_{C_i}, d_1, d_2, d_{C_o}, d_{S_o})$$

 $\kappa = (4, 4, 4, 8, 4, 8, 8).$ 

3) Random Logic: Fig. 3(c) shows the circuit diagram of the random logic implementation [21]. This layout has the following 4 parameters: one node for computing  $\overline{carry}$ , one for  $\overline{sum}$ , one for carry, and one for sum.

 $\pi = (d_{\overline{C_o}}, d_{\overline{S_o}}, d_{C_o}, d_{S_o})$  $\kappa = (8, 12, 4, 4).$ 

The following logical equations describe the circuit:

TABLE I Performance Comparison (One-Bit Full Adder)

| Туре          | A    | $P_{ave}$ | $P_{\rm max}$ | T    | APT  | PT   |
|---------------|------|-----------|---------------|------|------|------|
| PLA           | 21.2 | 7.2       | 10.1          | 9.7  | 2074 | 980  |
|               | 21.3 | 7.5       | 10.6          | 9.7  | 2185 | 1026 |
| Data Selector | 8.2  | 3.3       | 5.2           | 14.7 | 628  | 760  |
|               | 8.4  | 3.5       | 5.6           | 14.7 | 699  | 830  |
| Random Logic  | 9.3  | 1.7       | 2.4           | 7.2  | 161  | 173  |
|               | 9.3  | 1.8       | 2.6           | 7.3  | 179  | 197  |

$$C_o = A \cdot (S_i + C_i) + C_i \cdot S_i,$$

$$S_o = C_o \cdot (A + S_i + C_i) + C_i \cdot S_i \cdot A.$$

### V. COMPUTATIONAL RESULTS OF THE FULL-ADDER CIRCUITS OPTIMIZATION

Table I shows a comparison of the performance of our implementations. Each row represents one point locally optimal with respect to T. The units of A,  $P_{\text{ave}}$ ,  $P_{\text{max}}$ , T, APT, and PT are  $10^3 \cdot \lambda^2$ , (mW), (mW), ns,  $(10^{-12} \cdot \lambda^2 \cdot W \cdot ns)$ , and  $(10^{-8} \cdot W \cdot ns)$ , respectively, in all tables.

Fig. 4 shows  $P_{\text{max}}$  versus T curves of the one-bit full adder for different topologies when minimizing T. These  $P_{\text{max}}$  versus T tradeoff curves are obtained as follows. During a critical-path optimization process, every time a current parameter  $\pi$  shows an improvement in cost (in this case, T), we plot the associated values of  $P_{\text{max}}(\pi)$  and  $T(\pi)$ . We then obtain a trajectory from each initial random starting point to a locally optimal point. By drawing an envelope of these points on many trajectories, we finally obtain an approximate  $P_{\text{max}}$  versus T tradeoff curve.

As shown in Fig. 4, the  $P_{\text{max}}$  versus T tradeoff curve of the random logic circuit is below that of the data selector circuit and that of the PLA.

Table II shows a normalized performance comparison of the best locally optimal point for each layout, minimizing T. The random logic seems to be the best choice in all respects except A. The product  $P_{\text{max}} T$  of the random logic is about 1/4.4 that of the data selector, while it is about 1/5.7 that of the PLA.

Table III is the performance comparison table between the *T*-locally optimal circuit and the circuit designed using the minimum sizes  $(2\lambda)$ . Our optimization shows a good improvement of the delay time (improvement from 55 to 73 percent) in any implementation. Matson optimized the same random logic circuit and obtained a delay time of 8.0 ns in [14], while our locally optimal circuit has 7.2 ns, providing an independent check of the effectiveness of our optimization method.

VI. Full Optimization of  $n \times n$  Multipliers

In this section, we take up the problem of optimizing the one-bit full adder when it is embedded in a regular array, using the array multiplier illustrated in Fig. 5(a) for 4 bits.

Complete layouts of the  $3 \times 3$ ,  $4 \times 4$ ,  $8 \times 8$ , and  $10 \times 10$  multipliers were optimized using the critical-path optimization method. Fig. 6 shows the possible tradeoff



Fig. 4. The P-T tradeoff curves of one-bit full-adder circuits.

 TABLE II

 Normalized Performance Comparison (One-Bit Full Adder)

| Туре          | A   | Pave | P <sub>max</sub> | Т   | APT  | PT  |
|---------------|-----|------|------------------|-----|------|-----|
| PLA           | 228 | 424  | 421              | 135 | 1288 | 566 |
| Data Selector | 88  | 194  | 217              | 204 | 390  | 439 |
| Random Logic  | 100 | 100  | 100              | 100 | 100  | 100 |

TABLE III Performance Improvement Ratio

| Туре          | Cost | PLA           | Data<br>Selector | Random<br>Logic |
|---------------|------|---------------|------------------|-----------------|
| min size (ns) | T    | 21.7          | 53.2             | 26.9            |
| T-opt (ns)    | T    | 9.7           | 14.7             | 7.2             |
| improvement   | T    | -55.2 percent | ~72.3 percent    | -73.2 percent   |

of power against delay time obtained in the same way as Fig. 4.

Table IV gives the results of starting from 11 different initial parameter vectors. They yield only four distinct local optima, corresponding to the parameters  $\pi_1 = (4, 16, 8, 8, 8), \pi_2 = (8, 16, 8, 8, 8), \pi_3 = (12, 12, 8, 8, 8),$  and  $\pi_4 = (12, 16, 8, 8, 8)$ . Note that in Table IV, \*\* indicates that the associated  $\pi$  is not a local optimum.

As we can see in Table V, the running time of this optimization method increases quickly with the size of the array, growing approximately as the number of basic cells in the circuit, or  $n^2$  for an  $n \times n$  multiplier. This means that optimizing the entire circuit at once is a very costly operation, practical for a relatively small circuit, but not





(c) Fig. 5. (a)  $4 \times 4$  multipler. (b) One-bit full-adder cells. (c) AND cell.

for large circuits. We will see later how to take advantage of the circuit regularity to reduce the computational workload.

In the next section we will define a class of circuits consisting of rectangular arrays of one-bit full adders. In succeeding sections, we will analyze the critical paths of their circuits, and show that they can be constructed from information obtained by optimization of a small "representative" circuit, called a *canonical configuration*. We will then prove that this optimization also yields the same locally optimal one-bit full adders as direct optimization of large circuits. Finally, we give experimental results which confirm this fact, and show the utility of this approach.





 TABLE IV

 Local Optima (\*\* Means That This is Not a Local Optimum)

| Туре           | Cost          | $\pi_1$ | $\pi_2$ | $\pi_3$   | $\pi_4$   |
|----------------|---------------|---------|---------|-----------|-----------|
| 3 × 3          | Т             | 78.9    | 78.6    | 74.1      | 80.7(**)  |
|                | $P_{max}$     | 30.1    | 32.9    | 32.9      | 32.9      |
|                | Para          | 18.4    | 20.5    | 20.5      | 20.5      |
|                | A             | 82.7    | 84.0    | 82.4      | 87.9      |
| $4 \times 4$   | Т             | 118.9   | 118.1   | 110.2     | 120.3(**) |
|                | $P_{\rm max}$ | 54.6    | 60.2    | 60.2      | 60.2      |
|                | Pave          | 33.4    | 37.6    | 37.6      | 37.6      |
|                | A             | 152.3   | 154.7   | 151.5     | 162.0     |
| $8 \times 8$   | Т             | 279.6   | 277.1   | 377.9(**) | 283.5     |
|                | Pmax          | 224.7   | 251.0   | 224.6     | 251.0     |
|                | Pave          | 138.4   | 158.1   | 138.3     | 158.1     |
|                | A             | 636.2   | 646.6   | 632.9     | 678.8     |
| $10 \times 10$ | Т             | 359.2   | 355.9   | 487.2(**) | 364.1     |
|                | $P_{\rm max}$ | 353.0   | 395.3   | 352.9     | 395.3     |
|                | Pave          | 217.7   | 249.4   | 217.6     | 249.4     |
|                | A             | 1001.7  | 1018.1  | 996.5     | 1069.3    |

TABLE V Optimization Computing Cost

|                                  |       |             | -           |               |
|----------------------------------|-------|-------------|-------------|---------------|
|                                  | m33   | <i>m</i> 44 | <i>m</i> 88 | <i>m</i> 1010 |
| Number (basic cells)             | 6     | 12          | 56          | 90            |
| CPU (one point)                  | 2 min | 3 min       | 14 min      | 22 min        |
| Average number of (iterations)   | 5.0   | 5.1         | 5.0         | 6.0           |
| Average number (points searched) | 61    | 84          | 61          | 100           |

#### VII. A CLASS OF CIRCUITS

We discuss a class of circuits consisting of one-bit full adders which have four nodes as shown in Fig. 3(c). They are  $N_{\overline{C_o}}$ ,  $N_{C_o}$ ,  $N_{\overline{S_o}}$ , and  $N_{S_o}$ , which yield the *carry*, *carry*, *sum*, and *sum* signals, respectively. Note that there are inputs  $C_i$  and  $S_i$  to the nodes  $N_{\overline{C_o}}$  and  $N_{\overline{S_o}}$ . The relations among nodes are shown in Fig. 7. Next we define a class of circuits as follows.

*Definition:*  $C_{FA} \equiv \{C \mid \text{the circuit } C \text{ has the following three properties.} \}$ 

1) The circuit C is an  $m \times n$  subarray of the two-dimensional infinite array of identical one-bit full adders shown in Fig. 8, for some m and n.

2) The one-bit full adder cell used in the array is the random logic circuit shown in Fig. 3(c). Note that the input  $A = a \cdot b$  is created from two other inputs  $\overline{a}$  and  $\overline{b}$  by a NOR circuit. The one-bit full adder is characterized by the circuit parameter  $\pi = (d_{a \cdot b}, d_{\overline{Co}}, d_{\overline{So}}, d_{Co}, d_{\overline{So}})$ . Since we use identical one-bit full adders in the entire array as mentioned above, we regard the circuit parameter  $\pi$  of the one-bit full adder as the circuit parameter of the entire circuit C.

3) The interconnection scheme of the one-bit full adders is shown in Fig. 9.

In Fig. 8 each cell is a one-bit full adder with coordinates (x, y). We measure x to the left and y down, and  $A_{x,y}$  designates a cell located in position (x, y). Each cell has two inputs  $(S_i:$  the sum input and  $C_i:$  the carry input) and two outputs  $(S_o:$  the sum output and  $C_o:$  the carry output) as shown in Fig. 9. Precisely speaking,  $A_{x,y}$  has two more inputs  $\overline{a}$  and  $\overline{b}$ , as shown in Fig. 5(b), but we do not include these inputs in our model because they will not appear in any critical path. In other words, we assume that two inputs  $\overline{a}$  and  $\overline{b}$  are available to any cell when a critical path reaches that cell. As shown in Fig. 9, the carry output  $C_o$  of  $A_{x,y}$  is propagated to the carry input of  $A_{x+1,y}$ , while the sum output  $S_o$  of  $A_{x,y}$  is propagated to the sum input  $S_i$  of  $A_{x-1,y+1}$ .

Here we analyze an  $n \times n$  multiplier which has *n* columns and (n - 1) rows of full adders. Hence, the  $n \times n$ multiplier corresponds to the rectangle bounded by the corner cells  $A_{0,0}$ ,  $A_{n-1,0}$ ,  $A_{n-1,n-2}$ , and  $A_{0,n-2}$  in Fig. 8.

#### VIII. DEFINITION OF CRITICAL PATHS

In this section we define a critical path between two cells in a circuit  $C \in C_{FA}$  and analyze the behavior of signals on a critical path.

Suppose that a path  $\alpha$  exits the one-bit full adder cell  $A_{x,y}$  from either the carry output  $C_a$  or the sum output  $S_a$ with high or low signal. Hence, in order to identify the state of the path  $\alpha$  at the exit of the cell  $A_{x,y}$ , we can use the representation (A, a) where A is either carry or sum, and a is either high or low. Let (C, 0) denote the state in which a path exits from the carry output with a low signal, and let (C, 1), (S, 0), and (S, 1) be defined analogously. Then the behavior of the path  $\alpha$  can be represented by a sequence of states. For example, the expression  $(C, 0) \rightarrow$  $(C, 0) \rightarrow (S, 1) \rightarrow (S, 0)$  represents the path which exits  $A_{0,0}$  with the low carry signal, exits  $A_{1,0}$  with the low carry signal, exits  $A_{2,0}$  with the high sum signal, and finally exits  $A_{1,1}$  with the low sum signal as shown in Fig. 8. We can thus represent the behavior of the path by the state transition diagram D as shown in Fig. 10(a). Note



Fig. 7. The nodes in the one-bit full adder.



Fig. 8. Two-dimensional array of one-bit full adders.



Fig. 9. The basic cells and the interface rule.

that in this diagram, the states C, D, S, and T are used instead of (C, 0), (C, 1), (D, 0), and (D, 1), respectively.

Lemma 1: The state transition diagram D in Fig. 10(a) correctly describes the behavior of the signal along any path in the array of Fig. 8.

**Proof:** In Fig. 7, each time a path passes through a node, the signal on that path changes from high (low) to low (high). For example, suppose we have a path  $\alpha$  which exits from the carry output  $C_o$  with a high signal. The path  $\alpha$  must pass through the nodes  $N_{\overline{C_o}}$  and  $N_{C_o}$ . Since a signal changes from low (high) to high (low) when a path passes through a node, the  $\overline{C_o}$  signal is low, and the input on the





Fig. 10. (a) State transition diagram along a critical path. (b) Minimized state transition diagram.

(b)

path into the node  $N_{\overline{C_o}}$  is high. Hence, the state (C, 1) can be reached either from the state (C, 1) or (S, 1).

In the same way we can determine other state transitions. Note that a path to the sum output  $S_o$  results from either  $N_{\overline{C_o}} \rightarrow N_{\overline{S_o}} \rightarrow N_{S_o}$  or  $N_{\overline{S_o}} \rightarrow N_{S_o}$ .

From the above discussion, we can define a finite automaton M that represents the state transitions along a path.

Definition: The finite automaton M is defined as follows:

 $M = (Q, \Sigma, \delta, q_0, F),$ 

where Q is the set of states,  $\Sigma$  is the alphabet,  $\delta$  is the state transition function,  $q_0$  is the initial state, and F is the set of final states.

$$Q = \{q_0, C, D, S, T\}$$
  
$$\Sigma = \{c, d, s_0, s_1, t_0, t_1\}$$

and

$$F = Q$$

where the states C, D, S, and T represent the states (C, 0), (C, 1), (S, 0), and (S, 1), respectively. The symbol c(d) indicates the transition from the node  $N_{\overline{C_0}}$  with a low (high) input to the node  $N_{C_o}$  with a high (low) output, while the symbol  $s_0(t_0)$  indicates the transition from the node  $N_{\overline{C_o}}$  with a high (low) input to the node  $N_{S_o}$  with a low (high) output. And the symbol  $s_1(t_1)$  indicates the transition from the node  $N_{\overline{S_o}}$  with a low (high) input to the node  $N_{S_o}$  with a low (high) output. The transition function  $\delta$  is shown in Fig. 10(a) where  $q_0$  and transitions from  $q_0$ such as  $\delta(q_0, c) = C$ ,  $\delta(q_0, d) = D$ ,  $\delta(q_0, s_0) = S$ , and  $\delta(q_0, t_0) = T$  are not shown.

Since we use the fixed one-bit full adder shown in Fig. 7, there is a fixed delay time associated with each state transition when the circuit parameter  $\pi$  is fixed. Hence, we can define a delay-time function w as follows.

*Definition:* Given a circuit parameter  $\pi$  and a finite automaton  $M = (Q, \Sigma, \delta, q_0, F)$ , defined above, the *delay*time function  $w_{\pi}$  is defined as follows:  $w_{\pi}: Q \times Q \to R$ such that  $w_{\pi}(q_1, q_2)$  is the delay time for calculating the corresponding output signal and propagating this signal when the transition  $(q_1, q_2)$  is in  $\delta$ . When the transition  $(q_1, q_2)$  is not defined in  $\delta$ ,  $w_{\pi}(q_1, q_2)$  is not defined. We use w instead of  $w_{\pi}$  when a circuit parameter  $\pi$  is fixed in the discussion. Since from the definition of the symbols, any transition with the same symbol has the same delay time, we also use the symbol to mean its delay time. Note that  $s_0 > s_1$  and  $t_0 > t_1$ , since the transition from the node  $N_{\overline{S_o}}$  to the node  $N_{S_o}$  is a part of the transition from the node  $N_{\overline{C_a}}$  to the node  $N_{S_a}$  through  $N_{\overline{S_a}}$ .

Let L(M) be the language accepted by the finite automaton M. The language L(M) corresponds to the set of all possible paths in our two-dimensional array. Let  $L_{mn} =$  $\{\alpha \in L(M) \mid |\alpha|_{c} + |\alpha|_{d} = m, |\alpha|_{s_{0}} + |\alpha|_{s_{1}} + |\alpha|_{t_{0}} +$  $|\alpha|_{t_1} = n$  where  $|\alpha|_a$  indicates the number of times the symbol "a" in the string  $\alpha$ . Thus,  $L_{mn}$  corresponds to the set of paths from  $A_{0,0}$  to  $A_{m-n-1,m-1}$ , or in other words,  $L_{mn}$  is the set of paths that have m carry stages and n sum stages.

We use the notation  $C_{FA,m,n}$  for representing the set of circuits  $C \in C_{FA}$  whose critical paths are in  $L_{mn}$ . We define the delay-time function w on a string in L(M) as follows. Let  $\alpha \in L(M)$  and  $\alpha = a_1 a_2 \cdots a_n$  where  $a_i \in \{c, d, s_0, d\}$  $s_1, t_0, t_1$  for  $1 \le i \le n$ . Then  $w(\alpha) \equiv \sum_{i=1}^n w(a_i)$ .

We can now define a *critical path* in our terms.

Definition: We call  $\alpha \in L_{mn} \subseteq L(M)$  a critical path when  $w(\alpha) \ge w(\beta)$  for all  $\beta \in L_{mn}$ . Define CPN as the set of all critical paths of all subcircuits in  $C_{FA}$  and  $CPN_{mn}$ = CPN  $\cap L_{mn}$ . We say that two paths are *equivalent* when their delay times are equal. When  $w(\alpha) < w(\beta) (w(\alpha) > \omega)$  $w(\beta), w(\alpha) = w(\beta)$  for two paths  $\alpha$  and  $\beta$ , we denote that by  $\alpha <_{w} \beta(\alpha >_{w} \beta, \alpha =_{w} \beta$ , respectively).

Our problem of finding a critical path in the  $m \times n$ multiplier then corresponds to the problem of finding a critical path from  $A_{0,0}$  to another cell  $A_{m-1,n-2}$ , as shown in Section VII.

Lemma 2: Every critical path of the  $n \times n$  multiplier has (2n - 3) carry calculation stages and (n - 1) sum calculation stages.

*Proof:* Every time a sum signal appears in a path, the y coordinate increases by 1, while the x coordinate decreases by 1. Every time a carry signal appears in a path, the x coordinate increases by 1, while the y coordinate remains the same. Hence, a path from the cell  $A_{0,0}$  with  $a_s$  sum stages and  $a_c$  carry stages reaches the cell  $A_{a_c-a_s,a_s}$ . In the  $n \times n$  multiplier, a critical path starting from the cell  $A_{0,0}$  finally comes out of the cell  $A_{n-1,n-2}$ with the sum signal. Hence,  $n - 1 = a_c - a_s$  and n - 2 $= a_s$ . Thus,  $a_c = n - 1 + a_s = 2n - 3$ . Hence, we proved that every critical path  $\alpha$  of the  $n \times n$  multiplier has (2n - 3) carry calculation stages, and (n - 2) sum calculation stages, plus another sum calculation in the cell  $A_{n-1,n-2}$ .

#### IX. ANALYSIS OF CRITICAL PATHS

In Section VIII, we saw that the behavior of the signals on the path can be described by the weighted state transition diagram M. In this section, we investigate the problem of finding the critical paths effectively, given the weighted state transition diagram.

Since the longest path between two nodes can be computed given the delay time of all paths between two nodes, we have the following theorem.

*Theorem 1*: Given a fixed parameter  $\pi$  and the delaytime function  $w_{\pi}: Q \times Q \to R$ , we can effectively construct the set CPN.

We next describe a more practical way of constructing a critical path in  $L_{mn}$ . We use  $R(a_1, a_2, \dots, a_k)$  to represent the pair  $(i, a_i)$  where  $a_i = \max(a_1, a_2, \cdots, a_k)$ ; that is, R tells us which argument is maximum.

*Theorem 2:* Given a circuit parameter  $\pi$ , knowledge of

$$R(c, d), R(s_0, t_0), R(s_0 + t_0, s_0 + s_1, t_0 + t_1),$$

and

$$R(s_0 + t_0, 2s_1, 2t_1)$$

is both necessary and sufficient to construct a critical path in  $L_{mn}$  for  $m \ge 0$  and  $n \ge 0$ .

We need the following lemmas to prove theorem 2. The first lemma shows that we can simplify our finite automaton.

Lemma 3: Let L be the set of strings accepted by the finite automaton  $M = (Q, \Sigma, \delta, q_0, F)$  defined in Section VIII and shown in Fig. 10(a). Let  $L_1$  be the set of strings accepted by the finite automaton  $M_1 = (Q_1, \Sigma_1, \delta_1, q_0, F_1)$ shown in Fig. 10(b), where  $Q_1 = \{q_0, A, B\}$ .  $F_1 = \{A, A, B\}$ . B}, and  $\delta_1$  is shown in Fig. 10(b). Then  $L(M) = L(M_1)$ . *Proof:* Use the state minimization algorithm in [24].

From now on, we will concentrate our attention on the reduced state automaton  $M_1$ . Next we will characterize the strings accepted by  $M_1$ .

1296

Lemma 4: Let E be the regular expression of the strings accepted by  $M_1$ . Then

$$E = a_0 a_1^* (t_0 b_1^* s_0)^* a_1^* (\epsilon + t_0)$$
$$+ b_0 b_1^* (s_0 a_1^* t_0)^* b_1^* (\epsilon + s_0)$$

where  $a_i = s_i + c$ ,  $b_i = t_i + d$  for i = 0, 1, and  $\epsilon$  is the empty input.

**Proof:** Let  $r_A(r_B)$  be the regular expressions of strings starting from and ending at state A(B). Then  $r_A = a_1^* (t_0 b_1^* s_0)^* a_1^*$  and  $r_B = b_1^* (s_0 a_1^* t_0)^* b_1^*$ . Let  $E_A(E_B)$  be the regular expressions of strings accepted at state A(B). Then  $E_A = a_0 r_A + b_0 r_B s_0$  and  $E_B = b_0 r_B + a_0 r_A t_0$ . Therefore,  $E = E_A + E_B$  is the desired regular expression.

**Definition:** For two regular expressions  $E_1$  and  $E_2$ , we define  $E_1 \sim E_2$  iff for any  $e_1 \in E_1(e_2 \in E_2)$ , there exists  $e_2 \in E_2(e_1 \in E_1)$  such that  $e_1 = e_2$ , that is,  $T(e_1) = T(e_2)$ . We also use  $e_1 \sim e_2$  when  $e_1 = e_2$ .

Lemma 5:  $E \sim (s_0 t_0 + ct_0 + ds_0 + s_0 + c + d) s_1^* c^*(s_0 + t_0)^* t_1^* d^*$ .

*Proof:* Note that for  $a, b \in \Sigma$ ,  $(a + b)^* \sim a^*b^*$ ,  $ab \sim ba$ . Hence,  $a_1^* = (s_1 + c)^* \sim s_1^*c^*$ ,  $b_1^* = (t_1 + d)^* \sim t_1^*d^*$ ,  $(t_0b_1^*s_0)^* \sim (s_0t_0)^*b_1^*$ , and  $(s_0a_1^*t_0)^* \sim (s_0t_0)^*a_1^*$ . From lemma 4, we can obtain desired result.

Let  $y_3 = \max(s_0 + t_0, s_0 + s_1, t_0 + t_1)$  and  $y_4 = \max(s_0 + t_0, 2s_1, 2t_1)$ . Then define  $z_1, z_2, z_3$ , and  $z_4$  as follows:

$$z_{1} = \max (c, d), \quad z_{2} = \max (s_{0}, t_{0}),$$

$$z_{3} = \begin{cases} s_{0}t_{0} & \text{if } y_{3} = s_{0} + t_{0} \\ s_{0}s_{1} & \text{if } y_{3} = s_{0} + s_{1}, \\ t_{0}t_{1} & \text{if } y_{3} = t_{0} + t_{1} \end{cases}$$

$$z_{4} = \begin{cases} s_{0}t_{0} & \text{if } y_{4} = s_{0} + t_{0} \\ s_{1}^{2} & \text{if } y_{4} = 2s_{1} \\ t_{1}^{2} & \text{if } y_{4} = 2t_{1}. \end{cases}$$

Lemma 6: Let  $\alpha$  be a critical path in  $L_{m,n}$ . Then

$$\alpha \leq_{w} \begin{cases} z_{1}^{m} z_{2} z_{4}^{k} & \text{if } n = 2k + 1 \\ z_{1}^{m} z_{3} z_{4}^{k-1} & \text{if } n = 2k. \end{cases}$$

**Proof:** Let  $\beta$  be a path in  $L_{m,n}$ . From lemma 5,  $\beta \sim c^{x_1} d^{x_2} \gamma$  where  $x_1 + x_2 = m$  and  $\gamma \in (s_0 + t_0 + s_1 + t_1)^n$ . Clearly,  $\beta \leq_w z_1^m \gamma = \max(c, d)^m \gamma$ . Hence, we only have to think about a critical path in  $L_{0,n}$ . Let  $\beta$  be a path in  $L_{0,n}$ . From lemma 5,  $\beta \sim (s_0 t_0 + s_0 + t_0) s_1^{x_1} t_1^{x_2} (s_0 t_0)^{x_3}$ .

We take the following two cases: 1) n = 2k + 1, and 2) n = 2k.

1) n = 2k + 1:  $\beta \sim \gamma s_1^{x_1} t_1^{x_2} (s_0 t_0)^{x_3}$  where  $\gamma \in (s_0 t_0 + s_0 + t_0)$ .

a) If  $\gamma = s_0 t_0$ , then either  $x_1$  or  $x_2$  is odd. Without loss of generality, we assume that  $x_1 = 2k_1 + 1$  and  $x_2 =$ 

 $2k_2$ . Since  $s_1 <_w s_0$ ,  $s_1^2 \le_w z_4$ ,  $t_1^2 \le_w z_4$ ,  $s_0 t_0 \le_w z_4$ , and  $s_0 \le_w z_2$ , we have

$$\beta \sim s_0 t_0 s_1 s_1^{2k_1} t_1^{2k_2} (s_0 t_0)^{x_3} <_w z_4 s_0 z_4^{k_1} z_4^{k_2} z_4^{x_3} \le_w z_2 z_4^{k_4}.$$

b) If  $\gamma = s_0$  or  $t_0$ , then  $\beta \sim \gamma s_1^{x_1} t_1^{x_2} (s_0 t_0)^{x_3}$  and  $x_1 \equiv x_2 \pmod{2}$ . Note that  $\gamma \leq_w z_2$ . If  $x_1 = 2k_1 + 1$  and  $x_2 = 2k_2 + 1$ , then

$$\beta \sim \gamma(s_1t_1)s_1^{2k_1}t_1^{2k_2}(s_0t_0)^{x_3} \leq_w z_2(s_0t_0)z_4^{k_1}z_4^{k_2}z_4^{x_3} \leq_w z_2z_4^k.$$

If  $x_1 = 2k_1$  and  $x_2 = 2k_2$ , then

$$\beta \sim \gamma s_1^{2k_1} t_1^{2k_2} (s_0 t_0)^{x_3} \leq_w z_2 z_4^k$$

Hence,  $\beta \leq z_2 z_4^k$ .

2) n = 2k:  $\beta \sim \gamma s_1^{x_1} t_1^{x_2} (s_0 t_0)^{x_3}$  where  $\gamma \in (s_0 t_0 + s_0 s_1 + t_0 t_1 + s_0 t_1 + s_1 t_0)$ . Since  $n = 2 + x_1 + x_2 + 2x_3$  is even, we know  $x_1 \equiv x_2 \pmod{2}$ . Since  $s_0 t_1 \leq w s_0 t_0$  and  $s_1 t_0 \leq w s_0 t_0$ , we have  $\beta \leq w z_3 s_1^{x_1} t_1^{x_2} (s_0 t_0)^{x_3}$ . In the same way as in b) above,  $s_1^{x_1} t_1^{x_2} (s_0 t_0)^{x_3} \leq w z_4^{k-1}$ . Thus,  $\beta \leq w z_3 z_4^{k-1}$ .

Definition: A string  $\beta \in \Sigma^*$  is said to be constructible when there exists a path  $\alpha \in L(M_1)$  such that  $\alpha \sim \beta$ .

Lemma 7: For any integers  $m \ge 0$  and  $k \ge 1$ , the strings  $z_1^m z_2 z_4^k$  and  $z_1^m z_3 z_4^{k-1}$  are constructible, and therefore, the upper bounds in lemma 6 are attained.

**Proof:** We will prove this for n = 2k + 1. The proof for *n* even is similar. Without loss of generality, we assume that  $z_2 = s_0$ . Now we consider the constructibility of  $z_2 z_4^k$ . We will find a string  $\alpha_{0,n} \in L_{0,n}$  such that  $\alpha_{0,n} \sim z_2 z_4^k$ . We have the following three cases for  $z_4$ .

1)  $z_4 = s_0 t_0$ : Let  $\alpha_{0,n} = s_0 (t_0 s_0)^k \in L_{0,n}$ . Then  $\alpha_{0,n} \in L(M_1)$  and  $\alpha_{0,n} \sim z_2 z_4^k$ .

2)  $z_4 = s_1^2$ : Let  $\alpha_{0,n} = s_0 s_1^{2k} \in L_{0,n}$ . Then  $\alpha_{0,n} \in L(M_1)$ and  $\alpha_{0,n} \sim z_2 z_4^k$ .

3)  $z_4 = t_1^2$ : Since  $s_0 + t_0 \le 2t_1 < t_0 + t_1$ , we have  $s_0 < t_1 < t_0$ . However, we assumed that  $z_2 = \max(s_0, t_0) = s_0$ . Thus, this case does not happen.

From 1), 2), and 3) we see that the string  $z_2 z_4^k$  is constructible.

Now we prove that  $z_1^m z_2 z_4^k$  is constructible. Let  $y_1 = s_0 c^m$  if  $z_1 = c$ , or let  $y_1 = d^m s_0$  if  $z_1 = d$ . Let  $\alpha_{0,n} = s_0 y_2$ . Let  $\alpha = y_1 y_2$ . Here we know that  $\alpha \in L_{m,n}$  and  $\alpha \sim z_1^m z_2 z_4^k$ . Thus, we proved that  $z_1^m z_2 z_4^k$  is constructible.  $\Box$ 

The following example shows how to construct a critical path  $\beta$  in  $L_{10,5}$  from knowledge of R functions, given a fixed parameter  $\pi = (12, 16, 12, 8, 8)$ . By computation we obtain a critical path  $\alpha = cct_0 s_0$  in  $L_{2,2}$ ,  $w_c = 10.3$ ns,  $w_{t_0} = 18.7$  ns, and  $w_{s_0} = 15.2$  ns. Since  $w(cct_0 s_0) \ge 15.2$  $w(dds_0t_0)$ , we have  $\max(c, d) = c$ . Since  $cct_0s_0 \ge w cct_0t_1$ , we have  $s_0 + t_0 \ge t_0 + t_1 > 2t_1$ . Since  $\max(s_0, t_0) = t_0$ , we know  $s_0 + t_0 \ge 2s_0 > 2s_1$  and  $t_0 \ge s_0 > s_1$ . Thus,  $\max(s_0 + t_0, 2s_1, 2t_1) = s_0 + t_0$  and  $\max(s_0 + t_0, s_0 + t_0) = s_0 + t_0$  $s_1$ ,  $t_0 + t_1$  =  $s_0 + t_0$ . Then from lemma 7, we can construct a critical path  $\beta = c^{10} (t_0 s_0)^2 t_0 \in L_{10,5}$  and calculate the delay time  $w(\beta) = 189.5$  ns. In fact, actual computation shows that a critical path of  $L_{10,5}$  is  $\gamma =$  $c'(t_0s_0)c^2(t_0s_0)ct_0$  and its delay time  $w(\gamma)$  is 189.3 ns. Note that the critical paths  $\gamma$  and  $\beta$  are equivalent in the sense that  $w(\beta) = w(\gamma) = 10c + 3t_0 + 2s_0$ .

Lemma 8: Given a critical path  $\alpha$  in  $L_{1,3}$ , we know the values of the R functions in theorem 2.

**Proof:** Let  $\alpha$  be a critical path in  $L_{1,3}$ .  $\alpha \sim \beta \gamma$  where  $\beta \in (c + d)$  and  $\gamma \in (s_0 + s_1 + t_0 + t_1)^3$ . Clearly,  $\beta = \max(c, d) = z_1$ . Since  $s_0 s_1 t_0 <_w s_0 t_0 s_0$  and  $s_0 t_0 t_1 <_w t_0 s_0 t_0$ , we have  $\gamma \sim (s_0 s_1^2 + s_0 t_0 s_0 + t_0 s_0 t_0 + t_0 t_1^2)$ . Suppose  $\gamma \sim s_0 s_1^2$ . Then we can find the values of the R functions in theorem 2 as follows. Since  $s_0 s_1 s_1 \ge_w s_0 t_0 s_0$ , we have  $s_0 + t_0 \le 2s_1 < s_1 + s_0$ . Thus,  $t_0 < s_1 < s_0$ . Hence,  $\max(s_0, t_0) = s_0$ ,  $\max(s_0 + t_0, s_0 + s_1, t_0 + t_1) = s_0 + s_1$  and  $\max(s_0 + t_0, 2s_1, 2t_1) = 2s_1$ . We can also compute the R functions in the other cases as above.  $\Box$ 

Now we prove theorem 2.

Proof of Theorem 2: Suppose we know the values of the R functions in theorem 2. From lemma 7, there is a path  $\beta$  in  $L_{m,n}$  such that

$$\beta \sim \begin{cases} z_1^m z_2 z_4^k & \text{if } n = 2k + 1 \\ z_1^m z_3 z_4^{k-1} & \text{if } n = 2k. \end{cases}$$

From lemma 6, the delay time of the path  $\beta$  is worse than any path in  $L_{m,n}$ . This means that the path  $\beta$  itself is a critical path. Conversely, we can find the desired *R* functions from a critical path in  $L_{1,3}$  from lemma 8.

From theorem 2 and lemma 8, we have found a new way to implement our critical-path optimization method, as shown in Fig. 11. By analyzing the small configuration (the circuit in  $C_{FA, 1, 3}$ ), we can avoid analyzing the entire circuit. This means that our optimization workload will be reduced significantly when the circuit is large.

#### X. A CANONICAL CONFIGURATION FOR THE $n \times n$ Multiplier

Although we found an effective way of computing a critical path of  $L_{mn}$  and its delay time, we still have a ma-



Fig. 11. New implementation of the critical-path optimization method.

Let  $T_{\pi}(n)$  denote the delay time of the critical path of the  $n \times n$  multiplier given a circuit parameter  $\pi$ . We use T(n) instead of  $T_{\pi}(n)$  when  $\pi$  is fixed in the discussion. We can now give an explicit formula  $T_{\pi}(n)$  for the  $n \times n$ multiplier.

Theorem 3: Given a circuit parameter  $\pi$ , then  $T_{\pi}(n)$  can be represented as follows:

$$T_{\pi}(n) = kx_1 + x_2, \tag{1}$$

where  $k = \lfloor n/2 \rfloor - 1$ ,

$$x_{1} = 4 \max(c, d) + \max(t_{0} + s_{0}, 2s_{1}, 2t_{1})$$

$$x_{2} = \begin{cases} \max(c, d) + \max(t_{0}, s_{0}) & \text{if } n = 2k \\ 3 \max(c, d) + \max(t_{0} + s_{0}, s_{0} + s_{1}, t_{0} + t_{1}) & \text{if } n = 2k + 1. \end{cases}$$

jor question left. That is, does there exist an effective way to find a locally optimal parameter for the large  $n \times n$ multiplier? In this section we prove that the answer to the question is yes and, furthermore, we will show a stronger result by introducing the idea of the *canonical configuration*.

Definition: A circuit  $CC \in C_{FA}$  is called the *canonical* configuration of the  $n \times n$  multiplier iff the optimization of CC yields the same parameters as the optimization of the  $n \times n$  multiplier, for all sufficiently large n.

We now consider the following problem.

**Problem 1:** Is there a canonical configuration CC of the  $n \times n$  multiplier? If a CC exists, what is it? How can it be found?

**Proof:** Let 
$$z_1$$
,  $z_2$ ,  $z_3$ , and  $z_4$  be as defined in Section IX. From lemma 2, a critical path  $\alpha$  of the  $n \times n$  multiplier is in  $L_{2n-3,n-1}$ . And from lemma 6, we know that

$$\alpha \sim \begin{cases} z_1^{2n-3} z_2 z_4^k & \text{if } n-1 = 2k+1 \\ z_1^{2n-3} z_3 z_4^{k-1} & \text{if } n-1 = 2k \end{cases} \quad \text{or} \quad \begin{array}{l} n = 2k+2 \\ n = 2k+1 \end{cases}$$

$$\alpha \sim \begin{cases} z_1^{4k-3} z_2 z_4^{k-1} & \text{if } n = 2k \\ z_1^{4k-1} z_3 z_4^{k-1} & \text{if } n = 2k+1 \end{cases}$$
where the set of the se

thus,

$$\alpha \sim \begin{cases} (z_1^4 z_4)^{k-1} (z_1 z_2) & \text{if } n = 2k \\ (z_1^4 z_4)^{k-1} (z_1^3 z_3) & \text{if } n = 2k + 1. \end{cases}$$
  
Note that  $k = \lfloor n/2 \rfloor - 1$ .

IWANO AND STEIGLITZ: OPTIMIZATION OF ONE-BIT FULL ADDERS

Corollary 1: Given a circuit parameter  $\pi$ , then  $T_{\pi}(n)$  is asymptotically proportional to n.

П

*Proof:* This is clear from (1).

In fact, from Table IV in Section V, we can obtain empirically the formula  $T_{\pi_1} = 40(n-1)$  when  $\pi_1 = (4, 16, 8, 8)$ .

Corollary 2: The circuit  $C_1 \in C_{FA,4,2}$  shown in Fig. 12 is a canonical configuration of the  $n \times n$  multiplier. The critical path of  $C_1$  has four carry stages and two sum stages.

**Proof:** From theorem 3, we have  $T(n) = \alpha_{4,2} \lfloor n/2 \rfloor + \beta$ , where  $\alpha_{4,2}$  is the delay time of four carry stages and  $\beta$  is the constant delay time. Hence, an optimal parameter of  $L_{4,2}$  is, asymptotically for large *n*, also an optimal parameter  $L_{mn}$ .

From lemma 2 in Section VII, there are (2n - 3) carry stages and (n - 1) sum stages in any critical path of the  $n \times n$  multiplier. Thus, we might expect that the circuit  $C_0 \in C_{FA,2,1}$ , whose critical path has two carry stages and one sum stage, is a canonical configuration of the  $n \times n$ multiplier. However, as we saw in this section, the circuit  $C_1 \in C_{FA,4,2}$  is a canonical configuration, but the circuit  $C_0 \in C_{FA,2,1}$  is not a canonical configuration. The reason is that optimization of the circuit  $C_0$  cannot determine max $(s_0 + t_0, 2s_1, 2t_1)$ .

## XI. THE OPTIMIZATION OF A SUBCIRCUIT

In this section, we optimize the two circuits  $C_0$  and  $C_1$  discussed in the previous section, verifying that the circuit  $C_1$  works well as a canonical configuration, but the circuit  $C_0$  does not. The circuit  $C_1$  is indicated by a solid line in Fig. 12, while the circuit  $C_0$  is indicated by a dotted line. The critical path from cell  $A_{0,0}$  to cell  $A_{2,2}$  is analyzed for  $C_1$  in Fig. 12, while the critical path from cell  $A_{0,0}$  to cell  $A_{1,1}$  is analyzed for  $C_0$ .

Table VI shows which parameters are locally optimal for m33, m44, m88, m1010,  $C_0$ , and  $C_1$ . The symbol  $\times$ indicates a local optimum. The same set of 11 random initial parameters were used for each circuit, and only these 5 distinct local optima were obtained.

The most important result is that every local optimum of m88 and m1010 is also a local optimum of  $C_1$ . This is not true for  $C_0$ , nor is it true for the smaller circuits m33 and m44. Thus, we can say that the circuit  $C_1$  is indeed appropriate as a representative subcircuit of the  $n \times n$ multiplier. In this way, corollary 2 in Section IX is confirmed very well by numerical experiments.

#### XII. CONCLUSIONS

We have described a general approach for sizing the transistors in a cell that is embedded in a regular array, using local search along the critical path. The simplest, most regular array multiplier structure was used as an example, with delay time (not throughput) as a criterion. No attempt was made to incorporate intermediate clocking, precharging, or superbuffers.



Fig. 12. The circuits  $C_0$  and  $C_1$ .

TABLE VI LOCALLY OPTIMAL PARAMETERS FOR EACH CIRCUIT

|                | $\pi_1$ | $\pi_2$ | $\pi_3$ | $\pi_4$ | $\pi_5$ |
|----------------|---------|---------|---------|---------|---------|
| C <sub>0</sub> | ×       |         |         |         | ×       |
| $C_1$          | ×       | ×       |         | ×       |         |
| m1010          | ×       | , ×     |         | ×       |         |
| m88            | ×       | ×       |         | ×       |         |
| m44            | ×       | ×       | ×       |         |         |
| m33            | ×       | ×       | ×       |         |         |

We quickly encountered the problem that the running time of the optimization increases rapidly when we increase the size of the multiplier. We therefore tried to make our optimization method more practical by making use of the circuit's regularity and developed what we call the *canonical configuration method*. In this method we locally optimize the small circuit  $C_1$  instead of the entire circuit. We showed how to extract this canonical configuration  $C_1$  for the  $n \times n$  multiplier, and gave experimental results that illustrate the savings offered by this method.

The following problems are the subject of future investigation. Do canonical configurations exist in more general regular arrays? If the canonical configuration exists for some regular array, how can we extract it?

#### References

- P. R. Cappello and K. Steiglitz, "Digital signal processing applications of systolic algorithms," in *CMU Conference on VLSI Systems* and Computations, H. T. Kung, B. Sproull, and G. Steel, Eds. Rockville, MD: Computer Science Press, 1981.
- [2] —, "Completely pipelined architectures for digital signal processing," *IEEE Trans. Acoust.*, Speech, Signal Processing, vol. ASSP-31, pp. 1016-1022, Aug. 1983.
- [3] —, "A note on 'free accumulation' in VLSI filter architectures," IEEE Trans. Circuits Syst., in press.
- [4] C. Caraiscos and B. Liu, "Bit serial VLSI implementations of FIR and IIR digital filters," in Proc. IEEE Int. Symp. Circuits Syst., May 1983.
- [5] P. B. Denyer and D. J. Myers, "Carry-save arrays for VLSI signal processing," in VLSI 81: Very Large Scale Integration, J. P. Gray, Ed. London, England: Academic, 1983. (Also, in Proc. Conf. Very Large Scale Integration, Univ. Edinburgh, U.K., Aug. 18-21, 1981.)
- [6] L. A. Glasser and L. P. J. Hoyte, "Delay and power optimization in VLSI circuits," in Proc. IEEE 21st Design Automat. Conf., 1984, pp. 529-535.
- [7] K. Iwano and K. Steiglitz, "Some experiments in VLSI leaf-cell optimization," presented at the 1984 IEEE Workshop on VLSI Signal Processing, Univ. Southern Calif., Nov. 12-14, 1984, pp. 387-395.
- [8] ----, "Time-power-area tradeoffs for the nMOS VLSI full-adder," in

1985 Proc. Int. Conf. Acoust., Speech, Signal Processing, Tampa FL, Mar. 1985. pp. 1453-1456.

- [9] H. T. Kung, L. M. Ruane, and D. W. L. Yen, "A two-level pipelined systolic array for convolutions," *CMU Conference on VLSI Systems and Computations*, H. T. Kung, B. Sproull, and G. Steele, Eds. Rockville, MD: Computer Science Press, 1981.
- [10] E. Lawler, *Combinatorial Optimization*. New York: Holt, Reinhart, and Winston, 1976.
- [11] R. J. Lipton, S. C. North, R. Sedgewick, J. Valdes, and G. Vijayan, "VLSI layout as programming," ACM Trans. Program. Languages Syst., July 1983.
- [12] R. F. Lyon, "A bit-serial VLSI architecture methodology for signal processing," in VLSI 81: Very Large Scale Integration, J. P. Gray, Ed. London, England: Academic, 1981. (Also, in Proc. Int. Conf, Very Large Scale Integration, Univ. Edinburgh, U.K., Aug. 18-21, 1981.)
- [13] J. Mata, "ALLENDE: A procedural language for the hierarchical specification of VLSI layouts," in Proc. IEEE 22nd Design Automat. Conf., 1985.
- [14] M. D. Matson, "Macromodeling and optimization of digital MOS VLSI circuits," Ph.D. dissertation, Dept. EECS, M.I.T., Cambridge, MA, Jan. 1985.
- [15] R. N. Mayo, J. K. Ousterhout, and W. S. Scott, "1983 VLSI tools," Comput. Sci. Div. (EECS), Univ. Calif., Berkeley, CA, Rep. UCB/ CSD 83/115, Mar. 1983.
- [16] J. V. McCanny, J. G. McWhirter, J. B. G. Roberts, D. J. Day, and T. L. Thorp, "Bit level systolic arrays," in *Proc. 15th Asilomar Conf. Circuits, Syst., Comput.*, Nov. 1981.
- [17] S. Trimberger, "Automated performance optimization of custom integrated circuits," in VLSI 83: VLSI Design of Digital Systems, F. Anceau and E. J. Aas, Eds. Amsterdam, The Netherlands: North-Holland, 1983. (Also, in Proc. IFIP Int. Conf. Very Large Scale Integration, Trondheim, Norway, Aug. 1983.)
- [18] A. J. Strojwas, S. R. Nassif, and S. W. Director, "Optimal design of VLSI minicells using a statistical process simulator," in *Proc. IEEE Int. Conf. Circuit Syst.*, 1983, pp. 202-205.
- [19] J. K. Ousterhout, "Switch-level delay models for digital MOS VLSI," in Proc. IEEE 21st Design Automat. Conf., 1984, pp. 542-547.
- [20] D. J. Myers, "Multipliers for LSI and VLSI signal processing applications," Masters thesis, Edinburgh Univ., Edinburgh, England, Sept. 1981.
- [21] J. Allen, "VLSI architectures for signal processing," in VLSI Architecture, B. Randell and P. C. Treleaven, Eds. Englewood Cliffs, NJ: Prentice-Hall, 1983, pp. 242-254.
- [22] C. Mead and L. Conway, Introduction to VLSI Systems. Reading, MA: Addison-Wesley, 1980.
- [23] N. Jouppi, "Timing analysis for nMOS VLSI," in Proc. IEEE 20th Design Automat. Conf., June 1983, pp. 411-418.

[24] J. E. Hopcroft and J. P. Ullman, Introduction to Automata Theory, Languages and Computation. Reading, MA: Addison-Wesley, 1979.

[25] C. W. Wu, P. R. Cappello, and M. Saboff, "An FIR filter tissue," presented at the 1985 Proc. 19th Asilomar Conference on Circuit, Systems, and Computers, Pacific Grove, CA, Nov. 1985.



Kazuo Iwano received the B.S. degree in mathematics from the University of Tokyo, Tokyo, Japan, in 1975, and the M.S.E. and the M.A. degrees in electrical engineering and computer science from Princeton University, Princeton, NJ, in 1983 and 1984, respectively.

He is currently a Ph.D. student at Princeton University in the Department of Computer Science. When not in graduate school, he has been working for IBM-Japan, Ltd. since 1975. His current interests include the VLSI design, graph theory, and the design and analysis of algorithms.



Kenneth Steiglitz (S'57–M'64–SM'79–F'81) was born in Weehawken, NJ, on January 30, 1939. He received the B.E.E., M.E.E., and Eng.Sc.D. degrees from New York University, New York, NY, in 1959, 1960, and 1963, respectively.

Since September 1963 he has been at Princeton University, Princeton, NJ, where he is now a Professor of Computer Science, teaching and conducting research on VLSI design and implementation of signal processing, optimization algorithms, and the foundations of computing. He is

the author of Introduction to Discrete Systems (New York: Wiley, 1974), and co-author, with C. H. Papadimitriou, of Combinatorial Optimization: Algorithms and Complexity (Englewood Cliffs, NJ: Prentice-Hall, 1982).

Dr. Steiglitz is a member of the VLSI Committee of the IEEE ASSP Society, is serving his second term as member of the Administrative Committee, and has also served on the Digital Signal Processing Committee, and as Awards Chairman of that Society. He is an Associate Editor of the journal Networks, and is a former Associate Editor of the Journal of the Association for Computing Machinery. He is a member of Eta Kappa Nu, Tau Beta Pi, and Sigma Xi, and he received the Technical Achievement Award of the ASSP Society in 1981, and the IEEE Centennial Medal in 1984.