BSS 797: Principles of Parallel Computing

Lecture 15

Parallel Linear Algebra II

Algorithm 13: Matrix-Vector operations

The setup: Multiply an n*n matrix A with an n*1 vector B (to produce a vector C) on P processors. The matrix A is

A11 A12 A1n

A21 A22 A2n

......

An1 An2 Ann

and the vector B

...

One processor: assume enough memory to fit the matrix.

for(i=1; i <= n; i++)

for(j=1; j <= n; j++) C[i] += A[i][j] * B[j];

Complexity: T(1,n) = c n2 where c is a constant characterizing the processor.

Matrix-Vector on P nodes

Method I: The most obvious row partition the matrix and replicate an n*1 vector on P processors.
Matrix A is row partitioned to

a11@1 a12@1 a1n@1

a21@1 a22@1 a2n@1

a31@2 a32@2 a3n@2

a41@2 a42@2 a4n@2

a51@3 a52@3 a5n@3

a61@3 a62@3 a6n@3

......

an1@P an2@P ann@P

While the Vector is allocated on all processors:

b1@1<->P

b2@1<->P

b3@1<->P

b4@1<->P

b5@1<->P

b6@1<->P

...

bn@1<->P

Therefore, each processor is given n/P rows of the matrix and the entire column of the vector.
The actual implementation is easy to do.

Performance analysis for this row partition:

Each processor multiplies a matrix of (n/P)*n with a vector n*1. The cost is Tm(P, n) = c n * (n/P). The time to distribute the (n/P) * 1 vectors to form a (n * 1) vector on every processor is d * P * (n/P). Speed up

S(P,n) = T(1,n)/T(P,n) = P/(1 + c1 P/n).

Remarks:

1. h(P,n) ~= P/n: One more example

2. Method is easy to implement

3. Two defects:

1. rare to find applications of this mapping

2. memory redundancy

Method II: Row partition both the matrix and vector.

a11@1 a12@1 a1n@1

a21@1 a22@1 a2n@1

a31@2 a32@2 a3n@2

a41@2 a42@2 a4n@2

a51@3 a52@3 a5n@3

a61@3 a62@3 a6n@3

......

an1@P an2@P ann@P

The vector is

b1@1

b2@1

b3@2

b4@2

b5@3

b6@3

...

bn@P

Each processor is given n/P rows of the matrix and n/P rows of the vector.

Implementation Steps

Step I: Multiplying the portions shown locally in parallel.

a11@1 N N N

N a22@2 N N

N N a33@3 N

N N ... N

N N N ann@P

The vector is

b1@1

b2@2

b3@3

...

bn@P

Step II: Roll up the vectors n/P up by one processor (the 1st back to the bottom) and multiply

N a12@1 N N

N N a23@2 N

N N N a34@3

N N N ...

an1@P N N N

The vector is

b2@1

b3@2

b4@3

...

b1@P

Step III: Repeat Step II for P times until all vector elements visit all processors. Done!

Performance analysis

Each processor multiplies a matrix of (n/P)*n with a vector n*1. The cost is Tm(P, n) = c n * (n/P). The communication time roll up the subvectors is proportional to n. Thus, adding the final collection time, we found communication costs d' n (d'=machine constant, bigger than d.)

Speed up

S(P,n) = T(1,n)/T(P,n) = P/(1 + c'1P/n).

Remarks:

1. h(P,n) ~= P/n: One more example

2. memory is parallelized

3. this mapping is popular in applications

4. One defect: difficult to implement (communications)

Method III: Column partition both the matrix and vector.

a11@1 a12@2 a1n@P

a21@1 a22@2 a2n@P

a31@1 a32@2 a3n@P

a41@1 a42@2 a4n@P

a51@1 a52@2 a5n@P

a61@1 a62@2 a6n@P

......

an1@1 an2@2 ann@P

Vector is

b1@1

b2@2

...

bn@P

Each processor is given n/P columns of the matrix and n/P rows of the vector.

Implementation steps

Step I: Multiplying sub-Row 1 with entire B by all P processors and collecting a global sum (communication) to produce sub-Row 1 of C.

Step II: Multiplying sub-Row 2 with entire B by all P processors and collecting a global sum (communication) to produce sub-Row 2 of C.

Step III: Repeat Step II until all sub-Rows of A are processed.

Performance Analysis

Each processor multiplies a matrix of (n/P) * n with a vector n*1. The cost is Tm(P, n) = c n*(n/P). The communication time roll up the subvectors is proportional to n. Thus, adding the final collection time, we found communication costs d'' n (d''=machine const, bigger than d.)

Speed up:

S(P,n) = T(1,n)/T(P,n) = P(1 + c''1 P/n).

Remarks:

1. h(P,n) ~= P/n: One more example

2. memory is parallelized

3. reasonably eay to implement (communications)

4. this mapping can be seen often in applications

Algorithm 14: Matrix-Matrix operations

The setup: Multiply an n*n matrix A with another n*n matrix B with to produce C: Matrix A:

a11 a12 a1n

a21 a22 a2n

......

an1 an2 ann

Matrix B:

b11 b12 b1n

b21 b22 b2n

......

bn1 bn2 bnn

One One processor: assume enough memory to fit the matrix.

for(i=1; i <= n; i++)

for(j=1; j <=n; j++)

for(k=1; k <=n; k++) C[i][j] += A[i][k] * B[k][j];

Complexity: T(1,n) = c n^3
where c is a constant characterizing the processor.

Method I: Row-column partition

Name: aka, Ring Method

The setup: Multiply an n*n matrix A with another n*n matrix B with to produce C.

Initially,

Matrix A

a11@1 a12@1 a1n@1

a21@2 a22@2 a2n@2

......

an1@P an2@P ann@P

Matrix B

b11@1 b12@2 b1n@P

b21@1 b22@2 b2n@P

b31@1 b32@2 b3n@P

b41@1 b42@2 b4n@P

b51@1 b52@2 b5n@P

b61@1 b62@2 b6n@P

......

bn1@1 bn2@2 bnn@P

Remarks:

1. Row-column and column-row are symmetrical

2. Intrinsic 1D partition

Implementation Steps

Step I: Multiply sub-row 1 in A and sub-column 1 in B, c11 is created, by Processor 1. Multiply sub-row 2 in A and sub-column 2 in B, c22 is created, by Processor 2, etc. Multiply sub-row P in A and sub-column P in B, cPP is created, by Processor P, etc. The dirgonal sub-matrix is created, after Step I.

Step II: Roll up all sub-rows by one processor unit, and multiply to produce c21, c32, ..., and c1P by Processors 1, 2, P, respectively.

Matrix A

a21@1 a22@1 a2n@1

a31@2 a32@2 a3n@2

......

a11@P a12@P a1n@P

Matrix B

b11@1 b12@2 b1n@P

b21@1 b22@2 b2n@P

b31@1 b32@2 b3n@P

b41@1 b42@2 b4n@P

b51@1 b52@2 b5n@P

b61@1 b62@2 b6n@P

......

bn1@1 bn2@2 bnn@P

Step III: Roll up again and again until all sub-rows visit all processors. Done!

Performance analysis

One processor time:

T(1,n) = c n3 tcomp

On P processors,

T(P,n) = Troll + Tcomp:

roll up time and submatrix multiplication time.

Troll = (P-1) * n2/P * tcomm ~= n2tcomm

Tcomp ~= P2 c [n/P]3 tcomp.

Therefore,

T(P,n) = Troll + Tcomp ~= n2 tcomm + P2 c [n/P]3 tcomp.

Speed up:

S(P,n)= P /(1 + [P/(c n)] * tcomm/tcomp)

Performance Analysis Remarks:

1. overhead h(P,n) ~= P/n * tcomm/tcomp

2. universal law is proved again

3. only large problems benefit

4. method is easy to implement

5. quite often, natural way to decompose

6. memory efficient