Handout #2 Problem Set # 2

ELEG 652 Principles of Parallel Computer Architectures

Issued: Wednesday, September 21, 2005

Due: Wednesday, October5, 2005

Please begin your answer to every problem on a new sheet of paper. Be as concise and clear as you can. Make an effort to be legible. To avoid misplacement of the various components of your assignment, make sure that all the sheets are stapled together. You may discuss problems with your classmates, but all solutions must be written up independently.

One important application of high performance computing systems is solving large engineering problems efficiently. Some physical/engineering problems can be formulated as mathematical problems involving large number of algebraic equations, a million equations with million unknowns is not a rarity. The core computation in solving such systems of equations is matrix operations, either Matrix-Matrix Multiply or Matrix Vector Multiply depending on the problem and solution methods under study. Therefore, a challenge to computer architectures is how to handle these two matrix operations efficiently.

In this problem, you will study the implementation of the matrix vector multiply (or MVM for short). There are lots of challenging issues associated with the implementation of MVM. First of all, matrices in many engineering problems are sparse matrices – where the number of non-zeroes entries in a matrix is much less than the total size of the matrix. Such sparse matrices pose an interesting problem in terms of efficient storage and computation. The use of different compressed formats for such matrices have been proposed and use in practice.

There are several ways of compressing sparse matrices to reduce the storage space. However, identifying the non-zero elements with the right indices is not a trivial task. The following questions deals with three types of the sparse matrix storage.

Compressed Row Storage (CRS) is amethod that deals with the rows of a sparse matrix. It consists of 3 vectors. The first vector is the “element value vector”: a vector which contains all the nonzero elements of the array (collected by rows). The second vector has the column indices of the non-zero values. This vector is called “column index vector”. The third vector consists of the row pointers to each of the elements in the first vector. That is, elements of this “row pointer vector” consists of the element number (for example. xth element start from row k) with which a new row starts.

With the above format in mind try to compress the following sparse matrix and represent it in a CRS format:

An example of CRS is presented below

This is the result if the initial index is zero. If the initial index is one then every entry in vector 2 and 3 will be plus one, except for the last entry of Vector 3, which will be 9 in both cases. Both results (initial index 0 or 1) are acceptable, but please be consistent.

In this problem, you are going to experiment with an implementation of Matrix-Vector Multiply (MVM) using different storage formats. Matrix-Vector Multiply has the following pseudo-code:

where C and B are vectors of size N and A is a sparse matrix of size NxN[1]. In this specific code, the order is Sparse Matrix times Vector. Create a C code that:

Create a random dense vector of N size (B).
Reads a sparse matrix of size NxN (A) and transform it to CRS format
Calculate the MVM (A*B) of the vector and the sparse matrix in CRS format. Make sure that your code has some type of timing function (i.e. getrusage, clock, gettimeofday, _ftime, etc) and time the MVM operation using the CRS matrix.
Repeat the operation with a Compress Column Storage[2] format matrix and compare performance numbers. Report your findings, as which method has better performance, and run it for at least five matrix/vector sizes. You can use the same methods that were used in Homework 1 to calculate performance. Please report your machine configuration as you did in Homework 1.

Have a look at the following storage representation, known as Jagged Diagonal Storage (JDS).

Explain how it is derived (Hint: Think in the sequence of E.S.S. which stands for Extract, Shift and Sort)
Why is this format claimed to be useful for the implementation of iterative methods on parallel and vector processors?
Can you imagine a situation in which this format will perform worse?
Modify your program to create (read) matrices in a JDS and repeat your experiments as in (2). Report your results for this storage method as you did for CRS and CCS.

It has been a privilege for a computer architect to work directly with real world problems and associates matrices. Here is an opportunity for you. The following Fortran code generates a matrix representing a real world engineering problem.

From ftp:// download the Fortran code (generate.f)
Read the code and identify the storage format of the matrix generated;
Create a C version of the matrix generator
Match this format with one of the formats used for the MVM code in problem 2. (Hint: The matrix is too big to be handed converted!)
Run your MVM code on this matrix
Record your observations, issues and your results working with real matrices

You have learned from class that the performance of a vector architecture can often be described by the timing of a single arithmetic operation of a vector of length n. This will fit closely to the following generic formula for the time of the operation, t, as a function of the vector length n:

t = r-1∞ (n + n1/2)

The two parameters r∞ and n1/2 describe the performance of an idealized computer under the architecture model/technology and give a first order description of a real computer. They are defined as:
The maximum asymptotic performance r∞ - the maximum rate of computation in floating point operation performed per second (in MFLOPS). For the generic computer, this occurs asymptotically for vectors of infinite length.
The half-performance length n1/2 – the vector length required to achieve half the maximum possible performance.

The benchmark used to measure (r∞, n1/2) is shown below:

Please replace the call Timing_function with a C code timing function (clock, getrusage, etc). Please identify and explain your choices.

Assume vector machine X is measured (by the benchmark above) such that its r∞ = 300 MFLOPS and n1/2 = 10 and vector machine Y is measured (by the benchmark above) such that its r∞ = 300 MFLOPS and n1/2 = 100. What these numbers mean? How can you judge the performance difference between X and Y? Explain.
Describe how would you use this benchmark (or its variation) to derive (r∞, n1/2)
Rewrite the code to C, port them onto a Sun Workstation and run it. Tell us which machine you used. Choose a sensible NMAX (Suggestion: NMAX = 256).
Report your results and make plots of:
Runtime V.S. Vector Length
Performance (MFLOPS) V.S. Vector Length
What are the values of the two parameters you obtained?
State on which machine model and type you performed your experiments.

Problem 3.1

Match each of the following computer systems: KSR-1, Dash, CM-5, Monsoon, Tera MTA, IBM SP-2, Beowulf, SUN UltraSparc-450with one of the best descriptions listed below. The mapping is a one-to-one correspondence in this case.

A massively parallel system built with multiple-context processor and a 3-D torus architecture. (Tera MTA)
Linux-based PCs with fast Ethernet. (Beowulf).
A ring-connected multiprocessor using cache-only memory architecture. (KSR-1)
An experimental multiprocessor built with a dataflow architecture.(Monsoon)
A research scalable multiprocessor built with distributed shared memory coherent caches. (DASH)
An MIMD distributed-memory computer built with a large multistage switching network.(CM-5)
A small scale shared memory multiprocessor with uniform address space.( SUN UltraSparc-450 )
A cache-coherent non-uniform memory access multiprocessor built with Fat Hypercube network.

You are encouraged to search conference proceedings, related journals, and the Internet extensively for answers. For each computer system, write one paragraph about the system characteristics such as the maximum node numbers.

Hint: You need to learn how to quickly find references from Proceedings of International Symposium of Computer Architecture (ISCA), IEEE Trans. of Computer, IEEE Micro, the Journal of Supercomputing, IEEE Parallel & Distributed Processing.

Problem 3.2

Please try to fit the machine: “Cray Red Storm” into one of the classifications above. If none of them fits, please provide a short architecture description.

[1]If you want to get more information about the MVM, please take a look at the Fortran Implementation of MVM at

[2]Compress Column Storage is the same as CRS, but it uses columns pointers instead of row pointers; the non-zero elements are gatheredby columns and not rows; and Vector 2 contains the row indices instead of the column indices