Why Parallel Architecture s1

Vector processors

[Hennessy & Patterson, §B.1] Pipelines can be used to process vectors as well as scalar quantities.

• In the vector case, a single instruction operates on more than one set of operands.

• Multiple pipelines may be provided to allow different operations to take place on different streams of data.

The cost/performance ratio of vector processors is impressive, but only for problems that fit their mold.

Vector machines solve the problem that large scientific programs have large, active data sets, and thus poor data locality. Cache performance is thus quite bad.

In a vector machine, the compiler can often determine the access pattern and pipeline the accesses efficiently.

Several characteristics of vector machines make them capable of high performance:

• The computation of each result is independent of the computation of previous results, so there are no pipeline stalls due to hazards. How is this independence achieved?

• A single vector instruction can do the work of an entire loop.

• Vector instructions process operands that have a regular structure, and can thus be efficiently loaded into registers from interleaved memory.

Basic vector architecture

[H&P, §B.2] A vector machine usually has—

• an ordinary pipelined scalar unit, and

• a vector unit.

Most machines have integer, boolean, and floating-point instructions that operate on the data.

All vector machines built today are vector-register machines: All instructions take their operands from vector registers except load and store.

What would be the alternative to a vector-register machine?

Here is a diagram of a sample vector architecture. It is Hennessy and Patterson’s DLXV, and is loosely based on the Cray-1.
Its scalar portion is Hennessy and Patterson’s DLX (see ECE 521). /

The architecture has these components:

• Vector registers: 8 registers, each of which holds 64 doublewords.

• Vector functional units: Each unit is fully pipelined and can start a new operation on each clock cycle. A control unit must detect conflicts for functional units and data accesses.

• Vector load/store unit: A vector memory unit that loads or stores a vector to or from memory. Moves one word/cycle, after a startup latency.

• Scalar registers: 32 general-purpose and 32 floating-point.

On the next page is a table, taken from Hennessy and Patterson (Fig. B.2), giving the characteristics of several vector-register machines.

All vector operations in DLXV are double-precision floating point.

Instruction set

There are five kinds of vector instructions, depending on whether the operands and results are vectors or scalars:

Type Example

f1: V ® V lv

f2: V ® S pop

f3: V ´ V ® V addv

f4: V ´ S ® V addsv

f5: V ´ V ® S s_v

Processor / Year announced / Clock rate (MHz) / Reg-isters / Elements per register (64-bit elements) / Functional units / Load-store units
Cray-1 / 1976 / 80 / 8 / 64 / 6: add, multiply, reciprocal, integer add, logical, shift / 1
Cray X-MP
Cray Y-MP / 1983
1988 / 120
166 / 8 / 64 / 8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity / 2 load 1 store
Cray-2 / 1985 / 166 / 8 / 64 / 5: FP add, FP multiply, FP reciprocal/sqrt, integer (add shift, population count), logical / 1
Fujitsu VP100/20 / 1982 / 133 / 8–256 / 32–1024 / 3: FP or integer add/logical, multiply, divide / 2
Hitachi S810/820 / 1985 / 71 / 32 / 256 / 4: 2 integer add/logical,
1 multiply-add, and
1 multiply/divide-add unit / 4
Convex C-1 / 1985 / 10 / 8 / 128 / 4: multiply, add, divide, integer/logical / 1
NEC SX/2 / 1984 / 160 / 8+
8192 / 256 variable / 16: 4 integer add/logical, 4 FP multiply/divide, 4 FP add, 4 shift / 8
DLXV / 1990 / 200 / 8 / 64 / 5: multiply, divide, add, integer add, logical / 1
Cray C-90 / 1991 / 240 / 8 / 128 / 8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity / 4
Convex C-4 / 1994 / 135 / 16 / 128 / 3: each is full integer, logical, and FP (including multiply-add)
NEC SX/4 / 1995 / 400 / 8+
8192 / 256 variable / 16: 4 integer add/logical, 4 FP multiply/divide, 4 FP add, 4 shift / 8
Cray J-90 / 1995 / 100 / 8 / 64 / 4: FP add, FP multiply, FP reciprocal, integer/logical
Cray T-90 / 1996 / ≈500 / 8 / 128 / 8: FP add, FP multiply, FP reciprocal, integer add, 2 logical, shift, population count/parity / 4

Two special-purpose registers are also needed:

vlr the vector-length register, and

vm the vector-mask register.

Here are the vector instructions in DLXV:

Opcode / Operands / Function
addv / v1, v2, v3 / Add elements of v2 and v3, then put each result in v1.
addsv / v1, v2, f0 / Add f0 to each elt. of v2, then put each result in v1
subv / v1, v2, v3 / Subtract elts. of v3 from v2, then put each result in v1.
subvs / v1, v2, f0 / Subtract f0 from elts. of v2, then put each result in v1.
subsv / v1, f0, v2 / Subtract elts. of v2 from f0, then put each result in v1.
multv / v1, v2, v3 / Multiply elts. of v2 and v3, then put each result in v1.
multsv / v1, f0, v2 / Multiply f0 by each elt. of v2, then put each result in v1
divv / v1, v2, v3 / Divide elements of v2 by v3, then put each result in v1.
divvs / v1, v2, f0 / Divide elements of v2 by f0, then put each result in v1.
divsv / v1, f0, v2 / Divide f0 by elements of v2, then put each result in v1.
lv / v1, r1 / Load vector reg. v1 from memory starting at addr. r1
sv / r1, v1 / Store vector reg. v1 into memory starting at address r1
lvws / v1, (r1,r2) / Load v1 from addr. at r1 w/stride in r2, i.e., r1 + i * r2.
svws / (r1,r2), v1 / Store v1 at addr. in r1 w/stride in r2, i.e., r1 + i*r2.
lvi / v1,(r1+v2) / Load v1 with vector whose elements are at r1+v2(i), i.e., v2 is an index.
svi / (r1+v2), v1 / Store v1 with vector whose elements are at r1+v2(i).
cvi / v1, r1 / Create an index vector by storing the values 0, 1*r1, 2*r1, … , 63*r1 into v1.
s_ _v
s_ _sv / v1, v2
f0, v1 / Compare (=, ≠, >, <, ≥, ≤) the elements in v1 and v2. If condition is true, put a 1 in the corresponding bit vector; otherwise put a 0. Put resulting bit vector into vector-mask register (vm). The instruction s_ _sv performs the same compare, but using a scalar value as one operand.
Opcode / Operands / Function
pop / r1, vm / Count the 1s in vm and store count in r1.
cvm / Set the vector-mask register to all 1s.
movi2s / vlr, r1 / Move contents of r1 to the vector-length register.
movs2i / r1, vlr / Move the contents of the vector-length register to r1.
movf2s / vm, f0 / Move contents of f0 to the vector-mask register.
movs2f / f0, vm / Move the contents of the vector-mask register to f0.

Let us take a look at a sample program. We want to calculate

Y := a *X+ Y

X and Y are vectors, initially in memory, and a is a scalar. This is the daxpy (double-precision a times X plus Y) that forms the inner loop of the Linpack benchmark’s Gaussian elimination routine.

For now, we will assume that the length of the vector is 64, the same as the length of a vector register.

We assume that the starting addresses of X and Y are in rx and ry, respectively.

Here is the code for DLX (non-vector architecture).

1. ld f0, a

2. addi r4,rx,#512 Last address to load.

loop:

3. ld f2,0(rx) Load X[i]

4. multd f2,f0,f2 a *X[i]

5. ld f4,0(ry) Load Y[i]

6. addd f4,f2,f4 a *X[i] + Y[i]

7. sd f4,0(ry) Store into Y[i]

8. addi rx,rx,#8 Increment index for X

9. addi ry,ry,#8 Increment index for Y

10. sub r20,r4,rx Compute bound

11. bnz r20,loop Check if done.

Here is the DLXV code.

1. ld f0,a Load scalar a

2. lv v1,rx Load vector X

3. multsv Vector-scalar multiply

4. lv v3,ry Load vector Y

5. addv v4, Add

6. sv ry,v4 Store the result

In going from the scalar to the vector code, the number of is reduced from almost 600 to 6.

There are two reasons for this.

•

What can you say about pipeline interlocks? How does the change from scalar to vector code affect them?

Vector execution time

The execution time of a sequence of vector operations depends on—

• length of vectors being operated on,

• structural hazards (conflicts for resources) between operations, and

• data dependencies.

If two instructions contain no data dependencies or structural hazards, they could potentially begin execution on the same clock cycle.

(Most architectures can only initiate one operation per clock cycle, but for reasonably long vectors, the execution time of a sequence depends much more on the length of the vectors than on whether more than one operation can be initiated per cycle.)

Hennessy and Patterson introduce the term convoy to refer to the set of vector instructions that could potentially begin execution in a single clock period.

It provides an approximation to the time required for a sequence of vector operations.

A chime is the time it takes to execute a convoy (the time is called 1 convoy, regardless of vector length).

Thus,

• a vector sequence that consists of m convoys executed in m chimes;

• if vector length is n, this is approximately m´n clock cycles.

Measuring in terms of chimes explicitly ignores processor-specific overheads.

Because of the fact that most architectures can initiate only one vector operation per cycle, the chime count underestimates the actual execution time.

Example: Consider this code:

lv v1, rx Load vector X.

multsv v2, f0, v1 Vector-scalar multiply

lv v3, ry Load vector Y

addv v4, v2, v3 Add

sv ry, v4 Store the result

How many chimes will this sequence take?

How many chimes per flop are there?

The chime approximation is reasonably accurate for long vectors. For 64-element vectors, four chimes ≈ clock cycles. The overhead of issuing Convoy 2 in two separate clocks is small.

Startup time

The startup time is the chief overhead ignored by the chime model. For load and store operations, it is usually longer than for other operations. For loads:

• up to 50 clock cycles on some machines.

• 9 to 17 clock cycles on Cray 1 and Cray X/MP.

• 12 clock cycles on DLXV.

For stores, startup time makes less difference.

• Stores do not produce results, so it is not necessary to wait for them to finish.

• Unless, e.g., a load has to wait for a store to finish when there is only one memory pipeline.

Here are the startup penalties for DLXV vector operations:

Operation / Startup penalty
Vector add / 6
Vector multiply / 7
Vector divide / 20
Vector load / 12

Assuming these times, how long will it really take to execute our four-convoy example, above?

Convoy / Start time / 1st-result time / Last-result time
1.lv / 0
2.multsv lv / 12 + n + / 24 + 2n
3.addv / 25 + 2n + / 30 + 3n
4.sv / 31 + 3n / 31 + 3n + 12 / 42 + 4n

Thus, assuming that no other instruction can overlap the final sv, we have an execution time of 42 + 4n.

For a vector of length 64, this evaluates to

The number of clock cycles per result is therefore

The chime approximation is clock cycles per result. Thus the execution time with startup is 1.16 times higher.

Effect of the memory system

To maintain a rate of one word per cycle, the memory system must be able to produce/accept data this fast.

This is done by creating separate memory banks that can be accessed in parallel. This is similar to interleaving.

Memory banks can be implemented in two ways:

• Parallel access. Access all the banks in parallel, latch the results, then transfer them one by one. Like S access to interleaved memory. This is essentially S-access interleaving.

• Independent bank phasing. On first access, access all banks in parallel, then transfer words one at a time. Begin a new access as soon as one access is finished. This is essentially C-access interleaving.

If an initiation rate of one word per clock cycle is to be maintained, it must be true that—

# of memory banks ≥ memory access time in clock cycles.

Example: We want to fetch—

• a vector of 64 elements

• starting at byte address 136,

• with each memory access taking 6 “clocks.”

Then,

• How many memory banks do we need?

• What addresses are used to access the banks?

• When will each element arrive at the CPU?

We need 8 banks so that references can proceed at 1/clock cycle. We will employ independent bank phasing.

Here is a table showing the memory addresses (in bytes) by

• bank number, and

• time at which access begins.

Beginning at cycle # / 0 / 1 / 2 / Bank
3 / 4 / 5 / 6 / 7
0 / 192 / 136 / 144 / 152 / 160 / 168 / 176 / 184
6 / 256 / 200 / 208 / 216 / 224 / 232 / 240 / 248
14 / 320 / 264 / 272 / 280 / 288 / 296 / 304 / 312
22 / 384 / 328 / 336 / 344 / 352 / 360 / 368 / 376
… / … / … / … / … / … / … / … / …

Note that—

• Accesses begin 8 bytes apart, since

• The exact time that a bank transmits its data is given by

+the memory latency.

Memory latency is 6 cycles.