SUGAR Automatic Performance Tuning of Computational Kernels

Web Page:

Participating UC Berkeley Faculty:

Katherine Yelick (CS – contact - )

James Demmel (CS/Math– )

Automatic Performance Tuning’s goal is to automatically generate and tune computational kernels that are the bottlenecks in important computations, such as scientific computing and signal processing. These kernels, such as matrix operations and FFTs, have traditionally been hand-tuned by skilled programmers for each new kernel, compiler and architecture. Hand-tuning is slow and difficult, because the complexity of the compiler and architecture means small code changes can have large and unpredictable effects on performance. Automatic tuning replaces hand-tuning and automatically produces code that runs about as faster or faster than hand-tuning.

Automatic performance tuning works by generating a space of reasonable algorithms, and searching it for the best one. The search may be done in part at library-build-time, compile time, run time or a combination of the above, and so this is more than a compiler problem. This approach has had significant successes: Tennessee’s ATLAS package, which is based on our earlier PHIPAC project, produces tuned BLAS (Basic Linear Algebra Subroutines) which are the kernels in the Linpack Benchmark and many application codes. It is as fast or faster than many vendor libraries, and has been adopted as part of the standard Matlab release. MIT’s FFTW produced automatically tuned FFTs, and won the 1999 Wilkinson Prize for Numerical Software. See our web page for other related projects.

The goal of our BeBOP (Berkeley Benchmarking and Optimization) project is to extend automatic tuning to sparse matrix computations. We have made significant progress, described below, with significant speedups on a variety of processors, including the Sun Ultra IIi; Intel Pentium 3, Pentium 4, and Itanium; and IBM Power 3.

Sparse-Matrix-Vector Multiplication: y = A*x. This work began in the PhD thesis of Eun-Jin Im under Prof. Yelick. Im produced a prototype system called Sparsity that takes a sparse matrix A at run-time, and generates a new data structure and corresponding matrix-vector multiplication subroutine that is tuned for that matrix on the architecture being used. The simplest optimization is register blocking, where A is stored as a collection of dense r-by-c blocks for an appropriate r and c. Sparsity combines off-line and run-time search: An off-line “performance profile” of the architecture is obtained by measuring the speed of sparse-matrix-vector multiply where a “standard” matrix is stored using all “reasonable” register block sizes (r-by-c for r and c independently running from 1 to 12). Figure 1 shows the results for the Ultra IIi. Each colored block represents an r-by-c block size, color coded to indicate speed. One can see that 8-by-5 blocking offers a 2.03 speedup over the unblocked, 1-by-1 case. (These profiles look entirely different for different architectures.) Then, at run-time, the user’s sparse matrix A is sampled to estimate the fill-in it would suffer if it were stored as dense r-by-c blocks, i.e. the number of explicit zeros stored in the r-by-c blocks, that would not have to be stored (or have arithmetic performed on them) if 1-by-1 blocks were used. Many application matrices have natural block sizes exceeding 1-by-1 which cause no fill-in. Then, the values of r and c that optimize performance of y = A*x are estimated by choosing them to minimize the ratio of the estimated fill-in to the estimated performance from the performance profile.

Results for the Ultra IIi are shown in Figure 2. Each column represents a different matrix, from applications ranging from finite element modeling to linear programming to information retrieval. The top black line comes from an analytic performance model that estimates the maximum speed by modeling the memory traffic, since this operation is memory bound. The pink triangles validate our model by using actual memory traffic measured using a hardware performance monitor. The green circles represent the performance of Sparsity, which is typically within 20% of the upper bound. The red squares (typically on top of the green circles) show the performance for the optimal r and c chosen by exhaustive search rather than our cheaper heuristic. The blue asterisk is the standard unoptimized performance. We see that for matrices 1-9, we are significantly faster than unoptimized code, and for the remaining matrices both Sparsity and the unoptimized code are relatively close to the upper bound. In summary, register blocking can raise the performance as much as from 45% of optimal to 83% of optimal (for matrix 2), or from 5.2% of machine peak to 9.5% of machine peak (667 Megaflops on the Ultra IIi). Interestingly, on the 500 Megaflop Pentium III processor, our techniques can increase performance as much as from 8% to 19% of peak.

Next steps: Other optimizations for Sparse-matrix-vector multiplication. There are many other possible optimizations, whose preliminary speedup results we report here: cache blocking (up to 4x speedup on some platform, up to 1.6x on Ultra IIi), symmetry (up to 2x, 1.5x on Ultra IIi), splitting A=A1+A2 (up to 1.7x, 1.7x on Ultra IIi), reordering rows and columns (up to 2x, 1.8x on Ultra IIi), multiple vectors (up to 7x, 4.8x on Ultra IIi), and diagonal storage (up to 2x, 1.5x on Ultra IIi).

Next steps: Other matrix operations. If we go beyond sparse matrix vector multiplication to other operations, more speedups are possible: sparse triangular solve (up to 1.8x, 1.8x on Ultra IIi), and ATAx (up to 3.2x, 2.8x on Ultra IIi).

Desired Support. Depending on support, we would try our techniques on the latest Sun Ultra architectures (an Ultra III, perhaps with architectural variants), make our techniques widely available in one or more libraries (such as the new sparse BLAS standard, or PETSc) as well as extending our techniques to kernels from graphics, cryptography, and so on.