#### Bit-Systolic Arithmetic Arrays Using Dynamic Differential Gallium Arsenide Circuits

| Grant Beagles            | Kel Winters                     | A. G. Eldin          |
|--------------------------|---------------------------------|----------------------|
| <b>Texas Instruments</b> | Advanced Hardware Architectures | University of Toledo |
| Lewisville TX 75067      | Moscow ID 83843                 | Toledo OH 43606      |

Abstract - A new family of gallium arsenide circuits for fine grained bit-systolic arithmetic arrays is introduced. This scheme combines features of two recent techniques of dynamic gallium arsenide FET logic and differential dynamic single-clock CMOS logic. The resulting circuits are fast and compact, with tightly constrained series FET propagation paths, low fanout, no dc power dissipation, and depletion FET implementation without level shifting diodes.

#### 1 Introduction

The advantages of parallel arrays of serial arithmetic processing modules have been recognized since the 1950s. These arrays were initially seen as a way to speed up signal and information processing. With the advances in VLSI technology, it is now possible to easily realize these highly modular architectures.

A number of bit-systolic arithmetic arrays have been developed with the intent of maximizing the clock rate for a given CMOS process, including a 2-D convolver for image processing, an integer polynomial solver and, a finite-field polynomial solver. These arrays were modeled using SPICE in a 2-micron (minimum feature size) CMOS process offered through MOSIS. Even using this relatively inexpensive low-performance process and worst case models, cells were reliably modeled well in excess of 100MHz [9,13].

A new dynamic gallium arsenide logic circuit family was proposed in 1991 [1]. These circuits, resembling GaAs D-RAM circuits, have several advantages over previously used dynamic GaAs logic circuits. Prior dynamic GaAs circuits use both depletion and enhancement mode devices on the same die and typical process yield is about 50%. A very major advantage of the new family of dynamic circuits used for this project is that only depletion mode devices are required to realize any function. Using only one type of device should give significantly higher process yields.

A second advantage is the absence of any DC power supply. This characteristic completely eliminates the DC power dissipation problems that have severely limited the use of dynamic GaAs logic circuits. This characteristic is not without its downside, however, since the clock driver must be hefty enough to rapidly charge the capacitors used in the circuits and clock skew is critical.

Another characteristic which sets these circuits apart from other dynamic GaAs circuits is their regularity and lack of level shifting diodes, a messy requirement of previous dynamic GaAs families. These circuits may be very compact and possess the traits desirable in systolic architectures.

The systolic arithmetic cells described earlier were redesigned using this new dynamic GaAs logic. The new circuits were modeled using P-SPICE and process parameters charac-

8.1.1

N94-21724

teristic of the Vitesse depletion mode GaAsFET. The models show an order of magnitude improvement in clock speed when compared to the CMOS cells described by Winters et.al.

## 2 Dynamic GaAs Circuits

The complexity of GaAs FET VLSI circuits is limited by the maximum power dissipation while the uniformity of the device parameters determines the functional yield. For a given process yield, the functional yield can vary significantly. The variation is due, in part, to the use of different circuit structures that may be operated in either dynamic or static modes. Also, the sensitivity of the proper functionality to variations in the process parameters is highly dependent on the selection of the circuit structure and the mode of operation.

Static GaAs circuits are ratioed circuits. This is one of the reasons why their functionality and speed of operation are strongly dependent on the variations in the device threshold voltage. The fundamental requirement for dynamic circuits is that the ratio of the device current in its ON state to that in the OFF state should be sufficiently large (several orders of magnitude). This basic requirement for dynamic logic circuits can be met using GaAs circuits.

Dynamic circuits are ratioless, which makes their functionality completely insensitive to threshold voltage variations. The dynamic GaAs circuit's speed is also significantly less affected by these variations than in static logic designs.

The family of dynamic GaAs circuits used for the designs in this project do not dissipate DC power. Dynamic circuits have smaller device counts when compared to static circuits having similar functionality. These features are very attractive for the implementation of ultra high speed VLSI architectures as demonstrated in this paper.

The evolution of the new GaAs dynamic circuits [1], used in the implementation of the systolic arithmetic arrays is an extension of the ideas employed in the JCMOS DRAM Cell and the related work on BiCMOS dynamic memory and logic structures [10]-[12].

The operation of the new circuits can be easily explained using the diagrams in Figure 1. An intermediate stage of a dynamic shift register and its idealized functionality are illustrated.



Figure 1: The D-type flip-flop uses depletion mode transistors having a threshold voltage of -0.7 volts.

During  $T_1$ ,  $Phi_2 = 0$  volts. The master section is in the sample phase. The input data is stored on  $C_1$ .  $C_1$  is charged to 2 volts or discharged to 0 volts for a logical 1 or 0 respectively.  $C_2$  is precharged to approximately 3.5 volts through  $J_2(Phi_1 = -3.5 \text{ volts})$ .

The slave section is in the evaluation phase. Transistor  $J_3$  is in cut off and the drain voltage of  $J_2$ ,  $V_{D2} = 0$  volts, thus providing a reference voltage for evaluating the data stored on  $C_3$ . If  $C_3$  is charged,  $J_4$  is turned off and the precharged capacitor  $C_4$  retains its voltage (logic 1). However, if  $C_3$  is discharged,  $J_4$  is turned on and  $C_4$  is discharged to represent a logic 0. During  $T_2$ , the roles of the master and slave sections are interchanged.

Simulations of this circuit using a device model which accounts for second order side effects and is accurately calibrated to a 1 micron HFET process, verifies the operation of the D-Type flip-flop at 2GHz [1]. A comparison with DCFL implementations show that the dynamic circuit requires 30% less area and dissipates only 10% of the power (dynamic only) of the DCFL flip-flop.

The basic dynamic circuit can also be used to implement the AND, OR and complex logic functions. The operation of the basic circuit is similar to that of the dynamic flip flop.

Fig. 2 summarizes the operation. When the input is logic 0, the capacitor C is discharged during the sampling phase and  $J_2$  will turn on during the evaluation phase. Similarly, if the input is a logical 1, the capacitor is charged during the sampling phase causing  $J_2$  to turn off during the evaluation phase.



| 1 | C          | J <sub>2</sub> |
|---|------------|----------------|
| 0 | DISCHARGED | ON             |
| 1 | CHARGED    | OFF            |

Figure 2: Basic dynamic logic circuit

Figure 3 shows a complex dynamic logic gate.

Bit-systolic systems are defined here as synchronous digital systems whose combinatorial timing paths involve the computation of no more than one bit of data. Moreover, these systems are constrained to be locally connected, that is, modules at each level of hierarchy are connected exclusively to their nearest neighbors in the physical artwork. Therefore, bitsystolic systems are very fined grained pipelined architectures with tightly restricted fanout and interconnect capacitance. Such systems may also be said to be bit-extensible, implying that a computational word width may be extended to an arbitrary number of bits without affecting the clock speed (excluding clock and control signal loading).

For example, Figure 4 illustrates a bit-systolic serial-parallel multiply-accumulator introduced in 1990 as a building block for 2-D image convolver and single-instruction multiple data path (SIMD) processor arrays [13]. Here, the multiplier, x, is pipelined through n stages, and the multiplicand, y, is input in parallel form. The maximum fanout in the multiplier pipeline is 2 gate inputs. Summing is performed in an accumulator pipeline consisting of full-adder modules and pipeline delays aligning the product-sum to the multiplier. This



Figure 3: The dynamic complex logic gate

systolic multiplier requires 2n clock cycles to multiply two n-bit unsigned integers. The least significant bit of the product, x0y0, appears at the output n clock cycles into the multiplication sequence. The module may be easily extended to accommodate signed inputs. Operation of the bit-systolic multiplier is illustrated in Table 1.



Figure 4: Bit-systolic serial-parallel multiplier

The useful property of this configuration is that it contains n bits of storage for the multiplier and 2n bits for the product. An array of these modules would have the proper ratio of operand versus product storage. For instance, a product sum could be accumulated by a single multiplier whose output is fed back to its addend input in 2n clock cycles per multiplication.

The product pipeline can accumulate external addends with its partial products without additional adder logic. The external addend is shifted serially into the a input and must be pre-shifted n bits into the product pipeline before the LSB of the multiplier is entered. Thus, the lower n bits of the addend occupy the high n bits of the product pipeline at the beginning of a multiply sequence. Then, n multiplier bits are shifted into the multiplier

| Clock | y3   | y2         | y1             | уO                      |
|-------|------|------------|----------------|-------------------------|
| Cycle | x<3> | x-2>       | x<1>           | x<0>                    |
|       | a<3> | a<2>       | a<1>           | a<0>                    |
| 1     | r0   |            |                |                         |
|       | x0y3 |            |                |                         |
| 2     | 21   | 10         |                |                         |
|       | xly3 | x0y2       |                |                         |
| 3     | 12   | <u>81</u>  | 01             |                         |
|       | x2y3 | x0y3+x1y2  | x0y1           |                         |
| 4     | £1   | 12         | zi             | <b>61</b>               |
|       | x3y3 | x1y3+x2y2  | x0y2+x1y1      | ±0y0                    |
| . 5   | 6    | 13         | 2              | a                       |
|       |      | x2y3+x3y2  | x0y3+x1y2+x2y1 | x0y1+x1y0               |
| 6     | 0    | 0          | Et             | 12                      |
|       |      | x3y3       | x1y3+x2y2+x3y1 | x0y2+x1y1+x2y0          |
| 7     | 0    | 0          | 0              | Ð                       |
|       |      |            | x2y3+x3y2      | 20y3+x1y2+x2y1<br>+x3y0 |
| 8     | B    | o          | e              | Ð                       |
|       |      |            | x3y3           | x1y3+x2y2+x3y1          |
| 9     | 20   | 0          | B              | 0                       |
|       | x0y3 |            |                | x2y3+x3y2               |
| 10    | 83   | ¥0         | 0              | B                       |
|       | xly3 | x0y2       |                | x3y3                    |
| 11    | 12   | <b>2</b> ] | <b>x</b> 0     | G                       |
|       | x2y3 | x0y3+x1y2  | x0y1           |                         |

Table 1: Bit-systolic multiplier operation

pipeline, leaving the LSB of the accumulated product in the LSB of the product pipeline. During the next n clock cycles, the multiplier is shifted out (replaced by zeroes), the low n bits of the product are shifted out of the product pipeline, the high n bits of the product are left in the low half of the product pipeline, while the low n bits of the next addend are pre-shifted into the high product pipeline half. This is illustrated in Table 2 for a four bit Systolic S-P multiplier. Once again, the shaded rows represent the multiplier pipeline. The unshaded rows represent the accumulator pipeline, where only the contents of the first flip-flops in each bit-cell are shown.

| Clock<br>Cycle | y3         | y2           | y1                | уŨ                         |
|----------------|------------|--------------|-------------------|----------------------------|
|                | x<3>       | x<2>         | x<1>              | x<0>                       |
|                | a<3>       | a<2>         | a<1>              | a<0>                       |
| 1              | r0         |              |                   |                            |
|                | a3+x0y3    | al           | -                 |                            |
| 2              | <b>x1</b>  | 10           |                   |                            |
|                | a4+x1y3    | a2+x0y2      | a0                |                            |
| 3              | n          | rl           | 10                |                            |
|                | a5+x2y3    | a3+x0y3+x1y2 | al+x0y1           |                            |
| 4              | 13         | n            | πi                | <u>01</u>                  |
|                | a6+x3y3    | #4+x1y3+x2y2 | a2+x0y2+x1y1      | a0+x0y0                    |
| 5              | 8          | 13           | 2                 | 11                         |
|                | <b>a</b> 7 | £5+x2y3+x3y2 | a3+x0y3+x1y2+x2y1 | al+x0y1+x1y0               |
| 6              | e          | 0            | · 13              | \$2                        |
|                | a0         | a6+x3y3      | #4+x1y3+x2y2+x3y1 | a2+x0y2+x1y1+x2y0          |
| 7              | Ð          | 0            | 0                 | <u> </u>                   |
|                | al         | <b>a</b> 7   | a5+x2y3+x3y2      | a3+x0y3+x1y2+x2y1<br>+x3y0 |
| 8              | 0          | 0            | 0                 | 9                          |
|                | a2         | a0           | a6+x3y3           | s4+x1y3+x2y2+x3y1          |
| 9              | đi di      | 0            | 0                 | 8                          |
|                | a3+x0y3    | al           | a7                | a5+x2y3+x3y2               |

| Table 2: | Multiply, | Accumulate | Operation |
|----------|-----------|------------|-----------|
|----------|-----------|------------|-----------|

Each bit-cell of the multiplier module may be constructed from four basic cell types: XOR, Carry, Latch, and Product, which will be described later. In the CMOS version, these were implemented in differential dynamic circuits that, like the array architecture, were very constrained in propagation path, fanout, and connectivity. Mapping the basic differential circuit structure to dynamic GaAs circuits was straightforward.

## 3 GaAs Systolic Cells

The first step in this project was to design the circuits for each cell required. The CMOS circuits were available and their performance characteristics well documented [9]. The CMOS design required 5 cells; a latch, a product cell, a carry cell and both a P-logic and N-logic XOR cell. These cells could then be tiled into arrays with alternating P and N stages. The GaAs cells are composed of only depletion mode devices and so only 4 cells were necessary. Pipelining in the GaAs arrays is accomplished through the two clocks whereas the CMOS relies on the alternating P and N stages with a single clock signal. Figures 5 and 6 show the CMOS circuits and the corresponding GaAs analogs respectively.



Figure 5: CMOS logic cells for systolic arrays

A quick comparison of the CMOS and GaAs shows that the device count (including capacitors) is quite comparable. The CMOS process offered by MOSIS is a  $2\mu m$  (minimum feature size) 2 metal layer process. MOSIS offers a  $.7\mu m$ , 3 metal layer GaAs process (Vitesse). The smaller feature size along with an additional metal layer should equate to more densely packed circuits and hence reduced parasitic capacitances. Smaller parasitics, coupled with gallium-arsenide's higher charge mobility, mean that the cells should perform at a significantly faster clock speed than the CMOS circuits.



Figure 6: GaAs logic cells for systolic arrays

## 4 Simulation

The GaAs circuits were simulated using P\_SPICE (professional version). Since a circuit layout was not implemented, the parasitic capacitances were conservatively estimated. All of the GaAs circuits were simulated at 1 and 2 GHz and all of the cells showed reliable operation at 2 GHz.

#### 5 Conclusions

The speed of Gallium Arsenide technology implemented in differential dynamic circuits in bit-systolic arrays would enable very high performance solutions to a broad range of problems. The basic cell architectures themselves have been tested and their performance verified in CMOS [9]. The GaAs cells have the same functionality as the CMOS cells, however, since no layout of the circuits has been done, the performance evaluation is preliminary and conservative. It is estimated that, given a good layout, these cells would perform at least 2 times faster than the models used for this paper. GaAs Carry Cell (2GHz.)

Temperature: 27.0

4.0V 2.0V 0V ¦ -2.0V -4.0V 4.0ns 0.5ns 1.0ns 1.5ns 2.0ns 2.5ns 3.0ns 3.5ns 0s • v(clk1) v(cout)

Time

8.1.10

Date/Time run: 05/20/92 14:25:14

# References

- [1] A. G. Eldin, , "New Dynamic FET Logic and Serial Memory Circuits for VLSI GaAs Technology", 3rd NASA SERC Symposium on VLSI Design, Moscow, Idaho, 1991.
- [2] D. K. Ferry, Gallium Arsenide Technology, Howard Sams & Company, Indianapolis, Indiana, 1985.
- [3] N. Kanopoulos, Gallium Arsenide Digital Integrated Circuits A Systems Perspective, Prentice Hall, Englewood Cliffs, New Jersey, 1989.
- [4] S. I. Long and S. E. Butner, Gallium Arsenide Digital Integrated Circuit Design, McGraw-Hill Publishing Company, New York, New York, 1990.
- [5] C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley Publishing Company, Reading, Massachusetts, 1980.
- [6] N. Sclater, Gallium Arsenide IC Technology Principles and Practices, TAB Professional and Reference Books, Blue Ridge Summit, Pennsylvania, 1988.
- [7] M. Shur, GaAs Devices and Circuits, Plenum Press, New York, New York, 1987.
- [8] P. W. Tuinenga, SPICE A guide to Circuit Simulation & Analysis Using P-Spice, Prentice Hall, Englewood Cliffs, New Jersey, 1988.
- [9] K. Winters, D. Mathews, and T. Thompson, "Application Specific Serial Arithmetic Arrays", 2nd NASA SERC Symposium on VLSI Design, Moscow, Idaho, 1990.
- [10] A. G. Eldin and M. I. Elmasry, "VLSI Dynamic Memory" United States Patent #4,791,611 Dec 13, 1988.
- [11] A. G. Eldin and M. I. Elmasry, "New Dynamic Logic and Memory Circuit Structures For BICMOS Technologies" IEEE Journal of Solid State Circuits, VOL. SC-22, pp 450-453, June 1987.
- [12] A. G. Eldin and M. I. Elmasry, "A Novel JCMOS Dynamic RAM Cell For VLSI Memories" IEEE Journal of Solid State Circuits, VOL. SC-15, pp 715-723, June 1985.
- [13] K. Winters, "Serial Multiplier Arrays for Parallel Computation," NASA SERC Symposium on VLSI Design, Moscow, Idaho, 1990.