Using SSE Instruction Set

Implementation of JPEG2000

ECE 734

VLSI Array Structures

For Digital Signal Processing

Ami Mehta

Gilles MullerSpring 2004

Table of Contents

Introduction

1 - JPEG2000 Algorithm

1.1 - Architecture

1.2 – 2D-DWT

2 – Implementation

2-1 – Non optimized version

2.2 – Optimizations

3 – Verification

4 – Profiler

5 – Results

6 – References

Introduction

JPEG2000 is a new image compression standard being developed by the Joint Photographic Experts Group (JPEG), part of the International Organization for Standardization (ISO). It is designed for different types of still images (bi-level, gray-level, color, multicomponent) allowing different imaging models (client/server, real-time transmission, image library archival, limited buffer and bandwidth resources, etc), within a unified system.

JPEG2000 is intended to provide low bit rate operation with rate-distortion and subjective image quality performance superior to existing standards, without sacrificing performance at other points in the rate-distortion spectrum.

JPEG2000 addresses areas where current standards fail to produce the best quality of performance, such as:

Low bit rate compression performance (rates below 0.25 bpp for highly-detailed gray-level images)
Lossless and lossy compression in a single code stream
Seamless quality and resolution scalability, without having to download the entire file. The major benefit is the conservation of bandwidth
Large images: JPEG is restricted to 64k x 64k images (without tiling). JPEG2000 will handle image sizes up to ( 232-1)
Single decompression architecture
Error resilience for transmission in noisy environments, such as wireless and the Internet
Computer generated imagery
Compound documents
Region of Interest coding
Improved compression techniques to accommodate richer content and higher resolutions
Metadata mechanisms for incorporating additional non-image data as part of the file
JPEG2000 will be able to handle up to 256 channels of information, as compared to JPEG, which is limited to only RGB data. Thus, JPEG2000 will be capable of describing complete alternate color models, such as CMYK, and full ICC (International Color Consortium).

1- JPEG2000 Algorithm

1.1 - Architecture

The encoding procedure is as follows:

The source image is decomposed into components.

The image and its components are decomposed into rectangular tiles. The tile-component is the basic unit of the original or reconstructed image.

The wavelet transform is applied on each tile. The tile is decomposed in different resolution levels.

These decomposition levels are made up of subbands of coefficients that describe the frequency characteristics of local areas (rather than across the entire tile-component) of the tilecomponent.

Markers are added in the bitstream to allow error resilience.

The codestream has a main header at the beginning that describes the original image and the various decomposition and coding styles that are used to locate, extract, decode and reconstruct the image with the desired resolution, fidelity, region of interest and other characteristics.

The optional file format describes the meaning of the image and its components in the context of the application.

1.2 – 2D-DWT

This process decomposes the original image into two sub-bands, usually denoted as the coarse scale approximation (lower band) and the detail signal (higher band). This transform can be easily extended to multiple dimensions by using separable filters, i.e. by applying separate 1-D transforms along each dimension. In particular, we have implemented the most common approach, commonly known as the square decomposition. This scheme alternates between operations on rows and columns, i.e. one stage of the 1-D DWT is applied first to the rows of the image and then to the columns.

This process is applied recursively to the quadrant containing the coarse scale approximation in both directions. In this way, the data on which computations are performed is reduced to a quarter in each step.

From a performance point of view, the main bottleneck of this transformation is caused by the vertical filtering (the processing of image columns) or the horizontal one (the processing of image rows), depending on whether we assume a row-major or a column-major layout for the images.

The lifting filter computes the low frequency and high frequency coefficients in 1-D.

The following filter coefficients are used for JPEG2000 lossy filter:

2 – Implementation

2-1 – Non optimized version

We chose to implement the lifting transform using Daubichies 9/7 Analysis. This was done mainly because lifting requires less working memory and fewer arithmetic computations. The lossy filter was implemented since our focus is to make JPEG2000 faster on general purpose architectures, where some loss of resolution is tolerable and lossy compression gives a better compression ratio.

This version was implemented only using C code. The lifting filter described above computes high and low frequency coefficients in the same time, reusing intermediate results. This is a good point as only one filter is needed to compute both coefficients.

The wavelet transform poses a major hurdle for the memory hierarchy, due to the discrepancies between the memory access patterns of the two main components of the 2-D wavelet transform: the vertical and horizontal filtering. Consequently, the improvement in the memory hierarchy use represents the most important challenge of this algorithm from a performance perspective.

As a result, we needed to think about memory use even if this version isn’t optimized for SSE. The goal of this project was to optimize the algorithm implementation for speed. As a result, we chose the strategy to store all intermediate results independently preventing unnecessary false dependences, but thus increasing the amount of memory used.

The Mallat strategy uses an auxiliary matrix to store the results of the horizontal filtering.

In this way, as the following figure shows, the horizontal high and low frequency components are not interleaved in memory. This method prevents memory scattering and enable to exploit better SIMD parallelism.

The vertical filtering reads these components and writes the results into the original matrix following the order expected by the quantization step.

For our implementation we did the down sampling before calculations, to ensure none of our calculations were wasted and also reused calculations across computations whenever possible.

2.2 – Optimizations

The analysis of this C code using VTune Analyzer showed that the major bottleneck of our code was the memory access, and particularly a low hit rate in the L1 data cache. To solve this problem, the following optimizations are used:

Data alignment

Strictly speaking, data alignment is not required in our codes since the SSE instruction set includes instructions that allow unaligned data to be copied into and out of the vector registers. However, such operations are much slower than aligned accesses, which may cause a significant overhead. To avoid this drawback we have employed 16-byte aligned data in all our codes, although for the scalar versions this optimization has no significant effect.

Matrix juxtaposition

Input and output matrices are juxtaposed in the memory to prevent associativity conflicts. Indeed, after juxtaposition, the 2 matrices can be loaded entirely in the L1 data cache without having to switch between one matrix to another.

Here is a part of our code showing these optimizations:

/* Allocate memory to the 2 matrices */

/* The 2 matrices are juxtaposed and aligned on the cache row size:128 bits=16 Bytes */

inline void load_matrix(float **matrix_in, int row, int column, float **matrix_out){

unsigned long int size;

size=2*row*column*sizeof(float);

if( (*matrix_in=(float*)malloc(size+16)) == NULL ){

printf("Not enough memory available.\n");

exit(1);

}

//align the address at the beginning of a cache line

*matrix_in=(float*)((((int)(*matrix_in))+16)-((((int)(*matrix_in))+16)%16));

*matrix_out=(float*)((*matrix_in)+row*column);

}

Once these were done we transformed the code using SSE2. This was done to make use of the inherent parallelism in the code.SSE does four calculations at a time using special registers allocated for SSE. We used these 128 bit floating point registers to store four64 bit values. Then these computations were done in parallel. We used intrinsics for this part. The following piece of code is typical of our use of SSE intrinsics.

m1 =_mm_add_ps(*pscr2,*pscr0);

m2 = _mm_mul_ps(m0_a0, m1);

m3 = _mm_add_ps(*pscr1,m2);

// high1[i]=start1[i]+a0*(start2[i] + start0[i]);

m1, m2, m3, pscr2, m0_a0, etc are all 64 bit values, stored in groups of four as mentioned before. The following figure shows how the 128 bit registers are used for four calculations to take advantage of parallelism.

3 –Verification

We verified our code using MATLAB. We took an image and converted it to a matrix. We used this matrix as an input to our DWT code. Then the output matrix of our code was dumped into a file. We took this output matrix and normalized it by subtracting the smallest number and dividing by the greatest number and displayed it is an image. The following results were produced.

Original Image Matlab output for DWT

After row transform using our code After DWT, using our code

The results from our code match those of MATLAB which shows our code works.

4 –Profiler

As mentioned before, we used VTune Analyzer to profile our results. The VTune™ Performance Analyzer helps you analyze the performance of your application by locating hotspots. Hotspots are areas in your code that take a long time to execute. Not only that it also helps you find out what is causing these hotspots and decide what type of improvements to make. You can also track critical function calls and monitor specific processor events, such as cache misses, triggered by sections in your code. You can also calculate event ratios to determine if processor events are causing the hotspots.

The VTune analyzer collects performance data on your application and system, and displays it in graphs or tables. From this display, you can analyze the performance of your application and determine which portions of an application are slowest and why.

To optimize the performance of your application or system, you can do one or more of the following to find the performance bottlenecks:

Determine how your system resources, such as memory and processor, are being utilized to identify system-level bottlenecks.
Measure the execution time for each module and function in your application.
Determine how the various modules running on your system affect the performance of each other.
Identify the most time-consuming function calls and call sequences within your application.
Determine how your application is executing at the processor level to identify microarchitecture-level performance problems.

The VTune™ Performance Analyzer can find this information by automating the process of data collection with three types of data collectors, namely, sampling, call graph, and counter monitor.

Thus we decided to use VTune Analyzer. With the help of profiling we were able to identify that memory access was the biggest bottle neck for our code.

5 – Results

We conducted a set of simulation using the pre-optimized and post optimized versions of our code. The simulations were done on Pentium Celeron processor 2.4 GHz, with a 8K uops L1 trace cache and a 128 Kb L2 cache. We did a number of simulations, since profiling take samples to get a good estimate we ran a set of three simulations on images of size 256 x 256, 512 x 512 and 1024 x 1024. The results for the biggest image are shown here.

A set of three simulations were done on a 1024 x 1024 image. The results for the improvement in cycles per uops are given below.

As can be seen on an average there is an improvement of 44.61% was seen during optimization. The results for 1st level cache load misses are shown below.

As can be seen we dramatically reduced 1st level cache load misses by the various optimization techniques. The misses reduced on an average by 74%, from about a little below 5,000,000 to a little over 1,000,000. The second level cache improvement is as shown below.

There is an improvement of 53% on an average.

Thus it can be shown that our cache optimizations and using SSE code to exploit parallelism worked well causing an improvement on performance and cache access.

6 – References

[1] Taubman, David and Marcellin, Michael, JPEG 2000, Image compression Fundamentals, Standards and practice.

[2] Michael D. Adams, The JPEG-2000 Still Image Compression Standard.

[3] C. Tenllado, D. Chaver, L. Piñuel, M. Prieto and F. Tirado, Vectorization of the 2D Wavelet LiftingTransform using SIMD extensions.

[4] D. Chaver, C. Tenllado, L. Piñuel, M. Prieto and F. Tirado, 2-D Wavelet Transform Enhancement on General-Purpose Microprocessors: Memory Hierarchy and SIMDParallelism Exploitation.

[5] IA 32 Intel Software Developers Manual, Volume 2: Instruction Reference.