Supercomputing on Massively Parallel Bit-Serial Architectures

Consider the idea that supercomputing is a synergy of generic algorithms, languages and architectures and that real breakthroughs in parallel computing will be achieved by considering all three together in a simulated software environment. Engineering tradeoffs could be made between performance, machine transparency, standardization and program portability before any new machines are actually built. Standardized languages could be developed for generic subclasses of parallel machines; languages that really give high performance and encourage free parallel expression and "thinking in parallel".

My own research on the Goodyear MPP (Massively Parallel Processor), suggests that high-level parallel languages are practical and can be designed with powerful new semantics that allow algorithms to be efficiently mapped to the real machines. For the MPP these semantics include parallel/associative array selection for both dense and sparse matrices, variable precision arithmetic to trade accuracy for speed, micro-pipelined "train" broadcast, and conditional branching at the PE control unit level.

The preliminary design of a FORTRAN-like parallel language for the MPP has been completed and is being used to write programs to perform sparse matrix array selection, min/max search, matrix multiplication, Gaussian elimination on single bit arrays and other generic algorithms. The MPP timing estimate for Gaussian elimination of a 4K by 4K single bit matrix is under one second -- the equivalent of approximately 64 billion scalar operations. Parallel Gauss-Jordan matrix inversion is also being investigated. The estimated time to invert a 128 X 128, 32 bit real matrix using full pivoting on the MPP is 50 msec. This is roughly equivalent to a 100 MFLOP scalar rate.

The MPP is a SIMD machine of 16384 single bit processors arranged in a 128 X 128 array. Individual PE's are interconnected with their four nearest neighbors. Each PE can address 1024 bits of its own local memory. A 32 bit shift register in each PE allows for micro-pipelining of long words and faster partial sum accumulation for multiplication. The machine can execute 160 billion micro-instructions per second which translates to 800 GOPS for some instructions. Operations include single bit logical, shift, and add as well as column I/O and one or two dimensional routing in a spiral, cylinder, or torus. All operations can be directly or indirectly masked. The logical "or" of one bit per PE (SUMOR) can be used to pass array information back to the PE control unit for broadcast to other PE's, scalar I/O or conditional branching. If a second MPP were ever built, it might look considerably different than the current MPP. For example, it would certainly have greater memory depth -- at least 64K bits per PE. It might also have a reconfigurable bit/byte serial ALU, staged PE's for table lookup arithmetic, and pipelined SUMOR logic.

Ken Iobst
4/15/85
SUPERCOMPUTING ON MASSIVELY PARALLEL BIT-SERIAL ARCHITECTURES

- SUPERCOMPUTING DOMAIN
- NEW DIMENSIONS IN PARALLEL COMPUTING
- SOME GENERIC ALGORITHMS
- THE GOODYEAR MPP
- SOME MPP SPECIFIC ALGORITHMS CODED IN A FORTRAN-LIKE BIT-SERIAL PROGRAMMING LANGUAGE
- WHAT MIGHT A SECOND GENERATION MPP LOOK LIKE?
NEW DIMENSIONS IN PARALLEL COMPUTING

- BIT-SERIAL
- DIVISION BY $2^n \pm 1$
- MIN/MAX SEARCH
- MATRIX MULTIPLICATION
- COLUMN BROADCAST
- GAUSSIAN ELIMINATION
- LINEAR RECURRENCE
- PIPELINE/SEQUENTIAL
- PARALLEL
PERFORMANCE WITH INTEGER OPERANDS

128 x 128 ARRAY
10 MHz CLOCK RATE

ADD ARRAY TO ARRAY
SUBTRACT ARRAY FROM ARRAY
MULTIPLY ARRAY BY ARRAY

MILLIONS OF OPERATIONS PER SECOND

OPERAND WORDLENGTHS, BITS

0 8 16 24 32 40 48 56 64

10 100 1,000 10,000
DIVISION BY $2^n \pm 1$ EXAMPLE

FROM THE BINOMIAL THEOREM,

$$\frac{1}{1 \pm x} = 1 \mp x + x^2 \mp x^3 + \cdots \quad (x^2 < 1)$$

BY A CHANGE OF VARIABLE $y = \frac{1}{x}$ THEN

$$\frac{1}{y^{\pm 1}} = \frac{1}{y} \mp \frac{1}{y^2} \pm \frac{1}{y^3} \mp \cdots \quad (y^2 > 1)$$

NOW LET $y = 2^n$ AND DIVISION BY $2^n \pm 1$ REDUCES TO A SHORT SEQUENCE OF BINARY SHIFTS AND ADDS (AND/OR SUBTRACTS),

$$\frac{V}{2^n \pm 1} = \frac{V}{2^n} + \frac{V}{2^{2n}} + \frac{V}{2^{3n}} + \cdots$$

FOR EXAMPLE, LET $V = 237658$ AND $N = 10$ THEN

$$\frac{V}{2^N - 1} = \frac{237658}{1023} = 232.315$$

AFTER 3 SHIFTS AND 2 ADDS
THE GOODYEAR MPP

- SIMD MACHINE OF 16384 SINGLE BIT PROCESSORS ARRANGED IN A 128 X 128 ARRAY
- NEAREST NEIGHBOR INTERCONNECTIVITY
- 1024 BITS OF MEMORY PER PE
- 32 BIT SHIFT REGISTER ALLOWS FOR MICRO-PIPELINING AND FASTER MULTIPLICATION
- EXECUTION SPEED OF 160 BILLION MICRO-INSTRUCTIONS PER SECOND WHICH TRANSLATES TO 800 GOPS FOR SOME INSTRUCTIONS
- OPERATIONS INCLUDE SINGLE BIT LOGICAL, SHIFT, AND ADD AS WELL AS COLUMN I/O AND ONE OR TWO DIMENSIONAL ROUTING IN A SPIRAL, CYLINDER, OR TORUS
- ALL OPERATIONS CAN BE DIRECTLY OR INDIRECTLY MASKED
- THE LOGICAL "OR" OF ONE BIT PER PE (SUMOR) CAN BE USED TO PASS ARRAY INFORMATION BACK TO THE PE CONTROL UNIT FOR BROADCAST, SCALAR I/O, OR CONDITIONAL BRANCHING
PARALLEL/ASSOCIATIVE ARRAY SELECTION

REAL S(8:24), A[64,256](8:24)

S = SUMOR(A[64,256])
MAXIMUM OF 32 BIT INTEGER ARRAY
(OF UNIQUE VALUES)

BIT MAX[ ]
INTEGER A[128,128](0:32)
MAX=1
DO 1 I=1,32
IF (SUMOR(A[MAX](I))) MAX=A[MAX](I)
1 CONTINUE

; Declare MAX as bit mask over all PE's
; Declare A as a 128 x 128 unsigned integer array
; Initialize MAX to 1 over all PE's
; Scan bits in A from most to least significant bits
; Replace MAX with a new subset of maximum values for each non zero bit plane of A

MAXIMUM OF 32 BIT INTEGER ARRAY
(GENERAL CASE)

BIT MAX[ ],T[ ](46),INDEX[ ](14)
INTEGER A[128,128](0:32)
COMMON /INIT/ INDEX

MAX=1
T=A.CON. INDEX
DO 1 I=1,46
IF (SUMOR(T[MAX](I))) MAX=T[MAX](I)
1 CONTINUE

; Same algorithm as before except A array is first concatenated with the PE address field to insure uniqueness of result
REAL A[8,16,128](8:32), B[8,16,128](8:32),
& C[8,16,128](8:32), T[8,16,128](8:32)

READ A[,,1], B[1,,]
T=A[,,1...]*B[1...,,]
C=T[,,+]
PRINT C[,,1]
COLUMN BROADCAST EXAMPLE

\[ A_{i,j} = \]

\[ A_{i,j} = A_{i,J} \]

**REAL** \[ A[128,128](8:32) \]
\[ A=A[,J...] \]

**OR**

**REAL** \[ A[128,128](8:32) \]
**BIT** \[ M[ ] \]
\[ M=[128,128;,,J] \]
\[ A=A[,NOT.M][,128 \rightarrow] \]
COLUMN BROADCAST EXAMPLE

PROBLEM: TO BROADCAST A COLUMN OF FLOATING POINT NUMBERS ACROSS THE MPP ARRAY

SOLUTION #1: WITH PE’S INTERCONNECTED IN AN E/W CYLINDER; LOAD, SHIFT AND STORE THE 32 BIT VALUES ACROSS THE ARRAY. THIS TAKES APPROXIMATELY 3 X 32 X 128 = 12288 CYCLES.

SOLUTION #2: WITH PE’S INTERCONNECTED IN AN E/W CYLINDER; "TRAIN" BROADCAST THE 32 BIT VALUES ACROSS THE ARRAY. THIS CAN BE VIEWED AS A MICRO-PIPELINING OPERATION AND TAKES ONLY 207 CYCLES. THE ALGORITHM IS AS FOLLOWS:

- GET "TRAIN" OF 1 STOP BIT + 32 BIT VALUES OUT ONTO THE E/W PE CHANNEL (≈ 33 CYCLES)

- CIRCULATE "TRAIN" ONCE AROUND (≈ 128 CYCLES). DURING THIS PROCESS INDIVIDUAL PE’S WILL STORE THE "TRAIN" IN THEIR SHIFT REGISTERS. SHIFTING STOPS WHEN THE STOP BIT ENTERS THE CONDITIONAL MASK REGISTER OF EACH PE.

- STORE ALL SHIFT REGISTERS (≈ 32 CYCLES).
**GAUSSIAN ELIMINATION EXAMPLE**

**SINGLE BIT MATRIX**

<table>
<thead>
<tr>
<th>1</th>
<th>2</th>
<th>10</th>
<th>11</th>
<th>12</th>
<th>13</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>40</td>
<td>11</td>
<td>41</td>
<td>12</td>
</tr>
</tbody>
</table>

1 of 1000 BIT PLANES

MPP ARRAY

128 x 128
GAUSSIAN ELIMINATION EXAMPLE

BIT A[4000,4](1000), M[4000,4], USED(4000)
INTEGER PIVOT(4000,0:14), J1(0:2), J2(0:12), J(0:14)
EQUIVALENCE (J1,J(1)), (J2,J(3))

READ A
DO 1 I=1,4000
USED(I)=0
1 CONTINUE
DO 7 I=1,4000
DO 2 J2=1,1000
IF (SUMOR(A[I,J2])) GO TO 3
2 CONTINUE
GO TO 8
3 CONTINUE
DO 4 J1=1,4
IF (SUMOR(A[I,J1](J2))) GO TO 5
4 CONTINUE
5 CONTINUE
PIVOT(I)=J
USED(J)=1
M=A[I,J2].AND..NOT.[4000,4;I,J1] ; SAVE PIVOT COLUMN IN NEW MATRIX M, ZEROING THE PIVOT ROW VALUE
DO 6 J2=1,1000
6 CONTINUE
7 CONTINUE
8 CONTINUE
GAUSS-JORDAN MATRIX INVERSION

WITH FULL PIVOTING
PARALLEL DATA STRUCTURES

REAL ARRAYS

\[ U = [ A : I ] \quad \text{AUGMENTED MATRIX} \]

\[ V = [ : ] \quad \text{WORKING ARRAY} \]

\[ W = I : ] \quad \text{WORKING ARRAY} \]

BIT MASKS

\[ X = [ \overline{1} : \overline{0} ] \quad \text{PIVOTED ROW/COLUMNS} \]

\[ Y = [ \overline{1} : \overline{1} ] \quad \text{PIVOT ROW} \]

WHERE \( I \) IS THE IDENTITY MATRIX

\( \overline{I} \) IS THE UNITY MATRIX

\( \overline{0} \) IS THE ZERO MATRIX
OTHER DATA STRUCTURES

SCALARS

DET = 1     PIVOT
PARALLEL APPROACH TO MATRIX INVERSION

REPEAT FOLLOWING STEPS N TIMES

- Find next pivot
- Update determinate (optional)
- Zero pivot row and column in X
- Zero pivot row in Y
- Normalize pivot row in U
- Broadcast pivot row N times into V
- Broadcast pivot column 2N times into W
- Perform parallel row operations for a single pivot
- Reset pivot row in Y

Then reorder rows in U to form

\[ U = [ I : A^{-1} ] \]
PARALLEL MATRIX INVERSION ALGORITHM

FOR I = 1 TO N
  PIVOT = MAX|U| PER X
  DET = DET * PIVOT

  X\[ \begin{bmatrix} \\ \\ \hline 0 \\ \hline \end{bmatrix} = 0 \\

  Y\[ \begin{bmatrix} \\ \\ \hline 0 \\ \hline \end{bmatrix} = 0 \\

  U\[ \begin{bmatrix} \\ \\ \hline \end{bmatrix} = U\[ \begin{bmatrix} \\ \\ \hline \end{bmatrix} / PIVOT \\

  V\[ \begin{bmatrix} \\ \\ \hline \end{bmatrix} = U\[ \begin{bmatrix} \\ \\ \hline \end{bmatrix} \\

  W\[ \begin{bmatrix} 1 \ldots 1 \end{bmatrix} = U\[ \begin{bmatrix} \vdots \end{bmatrix} \\

  U = U - V * W PER Y \\

  Y\[ \begin{bmatrix} \vline \hline \end{bmatrix} = 1 \\

END I

FOR J = 1 TO N
  FOR I = 1 TO N
    IF U[I,J] = 1 THEN V[J,*] = U[I,*]
  END I
END J

U = V
MPP II:
WHAT MIGHT IT LOOK LIKE?

- MUCH GREATER MEMORY DEPTH: AT LEAST 64K BITS PER PE, WITH AT LEAST ONE LEVEL OF INDIRECT ADDRESSING.

- RECONFIGURABLE BIT/NIBBLE/BYTE SERIAL ALU

- STAGED PE'S FOR TABLE LOOKUP ARITHMETIC.
  HOW MANY TABLES? WHAT SIZE? RAM OR ROM?

- PIPELINED SUMOR LOGIC