BSS 797: Principles of Parallel Computing
Lecture 15
Parallel Linear Algebra II
Algorithm 13: Matrix-Vector operations
The setup: Multiply an n*n matrix A with an n*1 vector B (to produce a vector C) on P processors. The matrix A is
A11 A12 A1n
A21 A22 A2n
......
An1 An2 Ann
and the vector B
b1
b2
...
bn
One processor: assume enough memory to fit the matrix.
for(i=1; i <= n; i++)
for(j=1; j <= n; j++) C[i] += A[i][j] * B[j];
Complexity: T(1,n) = c n2 where c is a constant characterizing the processor.
Matrix-Vector on P nodes
Method I: The most obvious row partition the matrix and replicate an n*1 vector on P processors.
Matrix A is row partitioned to
a11@1 a12@1 a1n@1
a21@1 a22@1 a2n@1
a31@2 a32@2 a3n@2
a41@2 a42@2 a4n@2
a51@3 a52@3 a5n@3
a61@3 a62@3 a6n@3
......
an1@P an2@P ann@P
While the Vector is allocated on all processors:
b1@1<->P
b2@1<->P
b3@1<->P
b4@1<->P
b5@1<->P
b6@1<->P
...
bn@1<->P
Therefore, each processor is given n/P rows of the matrix and the entire column of the vector.
The actual implementation is easy to do.
Performance analysis for this row partition:
Each processor multiplies a matrix of (n/P)*n with a vector n*1. The cost is Tm(P, n) = c n * (n/P). The time to distribute the (n/P) * 1 vectors to form a (n * 1) vector on every processor is d * P * (n/P). Speed up
S(P,n) = T(1,n)/T(P,n) = P/(1 + c1 P/n).
Remarks:
1. h(P,n) ~= P/n: One more example
2. Method is easy to implement
3. Two defects:
1. rare to find applications of this mapping
2. memory redundancy
Method II: Row partition both the matrix and vector.
a11@1 a12@1 a1n@1
a21@1 a22@1 a2n@1
a31@2 a32@2 a3n@2
a41@2 a42@2 a4n@2
a51@3 a52@3 a5n@3
a61@3 a62@3 a6n@3
......
an1@P an2@P ann@P
The vector is
b1@1
b2@1
b3@2
b4@2
b5@3
b6@3
...
bn@P
Each processor is given n/P rows of the matrix and n/P rows of the vector.
Implementation Steps
Step I: Multiplying the portions shown locally in parallel.
a11@1 N N N
N a22@2 N N
N N a33@3 N
N N ... N
N N N ann@P
The vector is
b1@1
b2@2
b3@3
...
bn@P
Step II: Roll up the vectors n/P up by one processor (the 1st back to the bottom) and multiply
N a12@1 N N
N N a23@2 N
N N N a34@3
N N N ...
an1@P N N N
The vector is
b2@1
b3@2
b4@3
...
b1@P
Step III: Repeat Step II for P times until all vector elements visit all processors. Done!
Performance analysis
Each processor multiplies a matrix of (n/P)*n with a vector n*1. The cost is Tm(P, n) = c n * (n/P). The communication time roll up the subvectors is proportional to n. Thus, adding the final collection time, we found communication costs d' n (d'=machine constant, bigger than d.)
Speed up
S(P,n) = T(1,n)/T(P,n) = P/(1 + c'1P/n).
Remarks:
1. h(P,n) ~= P/n: One more example
2. memory is parallelized
3. this mapping is popular in applications
4. One defect: difficult to implement (communications)
Method III: Column partition both the matrix and vector.
a11@1 a12@2 a1n@P
a21@1 a22@2 a2n@P
a31@1 a32@2 a3n@P
a41@1 a42@2 a4n@P
a51@1 a52@2 a5n@P
a61@1 a62@2 a6n@P
......
an1@1 an2@2 ann@P
Vector is
b1@1
b2@2
...
bn@P
Each processor is given n/P columns of the matrix and n/P rows of the vector.
Implementation steps
Step I: Multiplying sub-Row 1 with entire B by all P processors and collecting a global sum (communication) to produce sub-Row 1 of C.
Step II: Multiplying sub-Row 2 with entire B by all P processors and collecting a global sum (communication) to produce sub-Row 2 of C.
Step III: Repeat Step II until all sub-Rows of A are processed.
Performance Analysis
Each processor multiplies a matrix of (n/P) * n with a vector n*1. The cost is Tm(P, n) = c n*(n/P). The communication time roll up the subvectors is proportional to n. Thus, adding the final collection time, we found communication costs d'' n (d''=machine const, bigger than d.)
Speed up:
S(P,n) = T(1,n)/T(P,n) = P(1 + c''1 P/n).
Remarks:
1. h(P,n) ~= P/n: One more example
2. memory is parallelized
3. reasonably eay to implement (communications)
4. this mapping can be seen often in applications
Algorithm 14: Matrix-Matrix operations
The setup: Multiply an n*n matrix A with another n*n matrix B with to produce C: Matrix A:
a11 a12 a1n
a21 a22 a2n
......
an1 an2 ann
Matrix B:
b11 b12 b1n
b21 b22 b2n
......
bn1 bn2 bnn
One One processor: assume enough memory to fit the matrix.
for(i=1; i <= n; i++)
for(j=1; j <=n; j++)
for(k=1; k <=n; k++) C[i][j] += A[i][k] * B[k][j];
Complexity: T(1,n) = c n^3
where c is a constant characterizing the processor.
Method I: Row-column partition
Name: aka, Ring Method
The setup: Multiply an n*n matrix A with another n*n matrix B with to produce C.
Initially,
Matrix A
a11@1 a12@1 a1n@1
a21@2 a22@2 a2n@2
......
an1@P an2@P ann@P
Matrix B
b11@1 b12@2 b1n@P
b21@1 b22@2 b2n@P
b31@1 b32@2 b3n@P
b41@1 b42@2 b4n@P
b51@1 b52@2 b5n@P
b61@1 b62@2 b6n@P
......
bn1@1 bn2@2 bnn@P
Remarks:
1. Row-column and column-row are symmetrical
2. Intrinsic 1D partition
Implementation Steps
Step I: Multiply sub-row 1 in A and sub-column 1 in B, c11 is created, by Processor 1. Multiply sub-row 2 in A and sub-column 2 in B, c22 is created, by Processor 2, etc. Multiply sub-row P in A and sub-column P in B, cPP is created, by Processor P, etc. The dirgonal sub-matrix is created, after Step I.
Step II: Roll up all sub-rows by one processor unit, and multiply to produce c21, c32, ..., and c1P by Processors 1, 2, P, respectively.
Matrix A
a21@1 a22@1 a2n@1
a31@2 a32@2 a3n@2
......
a11@P a12@P a1n@P
Matrix B
b11@1 b12@2 b1n@P
b21@1 b22@2 b2n@P
b31@1 b32@2 b3n@P
b41@1 b42@2 b4n@P
b51@1 b52@2 b5n@P
b61@1 b62@2 b6n@P
......
bn1@1 bn2@2 bnn@P
Step III: Roll up again and again until all sub-rows visit all processors. Done!
Performance analysis
One processor time:
T(1,n) = c n3 tcomp
On P processors,
T(P,n) = Troll + Tcomp:
roll up time and submatrix multiplication time.
Troll = (P-1) * n2/P * tcomm ~= n2tcomm
Tcomp ~= P2 c [n/P]3 tcomp.
Therefore,
T(P,n) = Troll + Tcomp ~= n2 tcomm + P2 c [n/P]3 tcomp.
Speed up:
S(P,n)= P /(1 + [P/(c n)] * tcomm/tcomp)
Performance Analysis Remarks:
1. overhead h(P,n) ~= P/n * tcomm/tcomp
2. universal law is proved again
3. only large problems benefit
4. method is easy to implement
5. quite often, natural way to decompose
6. memory efficient